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sequences is disclosed. The apparatus comprises a percep- 
tual metric generator having an input signal processing 
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processing section and a perceptual metric generating sec- 
tion. The luminance processing section simultaneously pro- 
cesses at least two image fields, so as to provide spatio- 
temporal channels whose calibration is independent of pure- 
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METHOD AND APPARATUS FOR 
ASSESSING THE VISIBILITY OF 
DIFFERENCES BETWEEN TWO SIGNAL 
SEQUENCES 

This application claims the benefit of U.S. Provisional 
Application No. 60/121,543 filed on Feb. 25, 1999, which is 
herein incorporated by reference. This application is also a 
continuation-in-part application of U.S. patent application 
Ser. No. 09/055,076 filed Apr. 3, 1998, which claims the 
benefit of U.S. Provisional Applications No. 60/043,050 
filed Apr. 4, 1997, and No. 60/073,435 filed Feb. 2, 1998, 
which are herein incorporated by reference. 

The present invention relates to an apparatus and con- 
comitant method for evaluating and improving the perfor- 
mance of signal processing systems. More particularly, this 
invention relates to a method and apparatus that assesses the 
visibility of differences between two signal sequences. 

BACKGROUND OF THE INVENTION 

Designers of signal processing systems, e.g., imaging 
systems, often assess the performance of their designs in 
terms of physical parameters such as contrast, resolution 
and/or bit-rate efficiency in compression/decompression 
(codec) processes. While these parameters can be easily 
measured, they may not be accurate gauges for evaluating 
performance. The reason is that end users of imaging 
systems are generally more concerned with the s ubject ive 
vi sual performance such as the v isibility of artifacts or 
distorti ons and in som e cases, the enh ancement of these 

imag e features whic hnnay reveal information such as the 

existen ce of a tumor in an ima ge, e.g., a MRI (Magnetic 
Resonance Imaging) image or i CAT (Co mputer-Assisted 
Tomography) scan im age. 

For example, an input image can be processed using two 
different codec algorithms to produce two different codec 
images. If the measure of codec image fidelity is based 
purely on parameters such as performing mean squared error 
(MSB) calculations on both codec images without consid- 
ering the ps ychophysical properties of hu man vision, the 
codec ima ge with a lower MSE val ue mav actually contain 
more notic eable distortion s than that of a codec image with 
a higher M SE val ue. 

Therefore, a need exists in the art for a method and 
apparatus for assessing the effe cts of physical param eters on 
the subjective pe rformance of a signal pro cessing system, 
e.g., an imaging system. Specifically, a need exists for a 
method andjip paratus for assessing the visibili ty of diff er- 
e nces betw een t wo sequences of time-varying visual images. 

SUMMARY OF THE INVENTION 

The present invention is a method and apparatus for 
assessing the visibility of differences between two input 
signal sequences, e.g., image sequences. The apparatus 
comprises a p erceptual metric gene rator having an input 
signal processing section, a lumin ance proce ssing section, a 
chrominance processing section and a perceptual metric 
ge nerating secti on. 

The input signal processing section transforms input 
signals into psychophysically defined qua ntities, e.g., lumi- 
nance compon ents and chrominance com ponents. The lumi- 
nance processing section pr ocesses the lumin ance compo- 
nents into a luminance perceptual metric, while the 
chrominance processing s ection process es the chrominance 
components into a chrominance perceptual metric. Finally, 
the perceptual metric generating section correlates the lumi- 
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nance p erceptual metric with the chromin ance perceptual 
metric in t o "a unified perceptual image metric, e.g., a just- 
noticea ble-difference (JND) m ap! 
The JND map is produced using i ndependent sp atial and 

5 temporal channels to process t he input s ignals. To enhance 
the pertormance ot the apparatus, channels having spatio- 
temporal filters are used to respon d to point or line flicker 
within the sig nals. Specifica lly, filtering is performed over 
multiple image fields to simulate visu al response to line 

10 flicker without alterin g respon se to pure spatial or temporal 
signals^ 

BRIEF DESCRIPTION OF THE DRAWINGS 

1S The teachings of the present invention can be readily 
understood by considering the following detailed descrip- 
tion in conjunction with the accompanying drawings, in 
which: 

FIG. 1 illustrates a block diagram of a signal processing 
20 system of the present invention; 

FIG. 2 illustrates a block diagram of the perceptual metric 
generator; 

FIG. 3 illustrates a block diagram of the input signal 
processing section; 
25 FIG. 4 illustrates a block diagram of the luminance 
processing section; 

FIG. 5 illustrates a block diagram of the chrominance 
processing section; 
30 FIG 6 illustrates a detailed block diagram of the lumi- 
nance processing section; 

FIG. 7 illustrates a block diagram of the luminance metric 
generating section; 

FIG. 8 illustrates a detailed block diagram of the chromi- 
35 nance processing section; and 

FIG. 9 illustrates a block diagram of the chrominance 
metric generating section; 

FIG. 10 is a graph illustrating Luminance Spatial Sensi- 
^ tivitydata; 

FIG. 11 is a graph illustrating Luminance Temporal 
Sensitivity data; 

FIG. 12 is a graph illustrating Luminance Contrast Dis- 
crimination data; 
45 FIG. 13 is a graph illustrating Disk Detection data; 

FIG. 14 is a graph illustrating Checkerboard Detection 
data; 

FIG. 15 is a graph illustrating Edge Sharpness Discrimi- 
50 nation data; 

FIG. 16 is a graph illustrating Chrominance Spatial Sen- 
sitivity data; 

FIG. 17 is a graph illustrating Chrominance Contrast 
Discrimination data; 
55 FIG. 18 is a graph illustrating Rating Predictions data; 

FIG. 19 illustrates a block diagram of an alternate 
embodiment of the luminance processing section; 

FIG. 20 illustrates a detailed block diagram of the alter- 
nate embodiment of the luminance processing section of 
FIG. 19; 

FIG. 21 illustrates a detailed block diagram of an alternate 
embodiment of the luminance metric generating section; 

FIG. 22 illustrates a block diagram of a luminance pro- 
55 cessing section for processing half-height images; 

FIG. 23 illustrates a block diagram of a luminance metric 
generating section for processing half-height images; 
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FIG. 24 illustrates a detailed block diagram of an alternate 
embodiment of the chrominance processing section; 

FIG. 25 illustrates a detailed block diagram of an alternate 
embodiment of the chrominance metric generating section; 

FIG. 26 illustrates a block diagram of a chrominance 
processing section for processing half -height images; 

FIG. 27 illustrates a block diagram of a chrominance 
metric generating section for processing half -height images; 

FIG. 28 depicts a block diagram of an alternative embodi- 
ment of the luminance processing section. 

DETAILED DESCRIPTION 

FIG. 1 depicts a signal processing system 100 that utilizes 
the present invention. The signal processing system consists 
of a signal receiving section 130, a signal processing section 
110, input/output devices 120 and a system under test 140. 

Signal receiving section 130 serves to receive input data 
signals, such as sequences of images from imaging devices 
or other time-varying signals such as audio signals from 
microphones or recorded media. Thus, although the present 
invention is described below with regard to images, it should 
be understood that the present invention can be applied to 
other input signals as discussed above. 

Signal receiving section 130 includes a data receiving 
section 132 and a data storage section 134. Data receiving 
section 130 may include a number of devices such as a 
modem and an analog-to-digital converter. A modem is a 
well-known device that comprises a modulator and a 
demodulator for sending and receiving binary data over a 
telephone line or other communication channel, while an 
analog-to-digital converter converts analog signals into a 
digital form. Hence, signal receiving section 130 may 
receive input signals "on-line" or in "real-time" and, if 
necessary, convert them to a digital form. As such, section 
130 may receive signals from one or more devices such as 
a computer, a camera, a video recorder or various medical 
imaging devices. 

The data storage section 134 serves to store input signals 
received by data receiving section 132. Data storage section 
134 contains one or more devices such as a disk drive, 
semiconductor memory or other storage media. These stor- 
age devices provide a method for applying a delay to the 
input signals or to simply store the input signals for subse- 
quent processing. 

In the preferred embodiment, the signal processing sec- 
tion 110 comprises a general purpose computer having a 
perceptual metric generator (or otherwise known as a visual 
discrimination me asure (VD M)) 112, a central processing 
unit (CPU) 114 and a memory 116 to facilitate image 
processing. The perceptual metric generator 112 can be a 
physical apparatus constructed from various filters or a 
processo r which is coupled to the CPU through a commu- 
nication channel. Alternatively, the perceptual metric gen- 
erator 112 can be im plemented as a software application, 
which is recalled from an input/output device 120 or from 
the memory 116 and executed by the CPU of the signal 
processing section. As such, the perceptual metric generator 
of the present invention can be stored on a computer 
readable medium. 

The signal processing section 110 is also coupled to a 
plurality of input and output devices 120 such as a keyboard, 
a mouse, a video monitor or storage devices including but 
not limited to magnetic and optical drives, diskettes or tapes, 
e.g., a hard disk drive or a compact disk drive. The input 
devices serve to provide inputs (control signals and data) to 
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the signal processing section for processing the input 
images, while the output devices serve to display or record 
the results, e.g., displaying a perceptual metric on a display. 
The signal processing system 100 using the perceptual 

5 metric generator 112 is able to predict the perceptual ratings 
that human subjects will assi gn to two signal sequences, e.g., 
a de graded color-image sequence rela tive to its non- 
degraded counterpart. The perceptual metric generator 112 
assesses trie visibility of differences between two sequences 

10 or streams of input images and pro duces several di fference 
es timates, including a s ingle metric of perceptu al differences 
betwe en the sequenc es" ITiese differences are quantified in 
units of the mode led human just-noticeable di fference (JND) 
metric. This metric can be expressed as a JND value, a JND 

15 map or a pro bability p rediction. In turn, the CPU may utilize 
the JND ima^emetric to optim ize v arious processes 
including, but licTTimited to, digi tal image compression, 
image qualitymeasiireme nt and targe t detection. 
To illustrate, an input image sequence passes through two 

20 different paths or channels to a signal processing system 
100. The input image sequence passes directly to the signal 
processing section without processing on one path (the 
reference channel or reference image sequence), while the 
same input image sequence passes on another path through 

25 a system under test 140, where the image sequence is 
processed in some form (the channel under test or test image 
sequence). The signal processing system 100 generates a 
perceptual metric t hat measures the differences betw een the 
two i mage sequen ces. The distortion generated by the sys- 

30 tern under test 140 is often incurred for economic reason, 
e.g., the system under test 140 can be an audio or video 
encoder. In fact, the system under test 140 can be any 
number of devices or systems, e.g., a decoder, a transmission 
channel itself, an audio or video recorder, a scanner, a 
display or a transmitter. Thus, signal processing system 100 
can be employed to evaluate the subjective quality of a test 
image sequence r elative to a referenc e image sequence, 
thereby providinglnformation as to the performance of an 
encoding process, a decoding process, the distortion of a 

40 transmission channel or any "system under test". Through 
the use of the perceptual metric generator 112, evaluation of 
the subjective quality of the test image relative to the 
reference sequence can be performed without the use of a 
human observer. 

Finally, the perceptual m etric can be used to modify or 
control the parameters of a system un der test via path 150. 
For example, the parameters of an encoder can be modified 
to produce an encoded image that has an improved percep- 

5Q tu al ratings, e.g., less noticeable distortion when.t he encoded 
im age is deco ded. Furthermore, although the system under 
test 140 is illustrated as a separate device, those skilled in the 
art will realize that a system under test can be implemented 
as a software implementation residing in the memory 116 of 

55 the signal processing section, e.g., a video encoding method. 
FIG. 2 illustrates a simplified block diagram of the 
perceptual metric generator 112. In the preferred 
embodiment, the perceptual metric generator comprises an 
input signal processing section 210, a luminance processing 

50 section 220, a chrominance processing section 230, a lumi- 
nance metric generating section 240, a chrominance metric 
generating section 250 and a perceptual metric generating 
section 260. 

The input signal processing section transforms input 
65 signals 205 into psychophysical^ defined quantities, e.g., 
luminance components and chrominance components for 
image signalsTThe input signals are two iEflage sequences of 
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arbitrary Length. Although only one input signal is illustrated generator is adjusted to be as sensitive as possible within the 

in FIG. 2, it should be understood that the input signal recommendations of the "Rec 500". However, the sensitivity 

processing section can process more than one input signal of the perceptual metric generator can be adjusted for a 

simultaneously. The purpose of the input signal processing particular application. 

section 210 is to transform input image signals to light 5 Second, the perceptual metric generator applies to screen 

outputs, and then to transform these light outputs to psy- luminances of 0.01-100 ft-L, (for which overall sensitivity 

chophysically defined quantities that separately characterize was calibrated), but with greatest accuracy at about 20 ft-L 

luminance and chrominance. (for which all spatiotemporal frequencies were calibrated). It 

More specifically, for each field of each input sequence, is also assumed that changing luminance incurs proportional 

there are three data sets, labeled Y', C b \ and C r ' at the top of 10 sensitivity changes at all spatiotemporal frequencies, and 

FIG. 2, derived, e.g., from a Dl tape. In turn, Y, C b , C r data this assumption is le ss important near 20 ft -L, where addi- 

are then transformed to R'. G' T and B' electron -gun voltages tional c alibration oc curred. Calibration and experimental 

that give rise to the displayed pixel values. In the input signal data are presented below. 

processing section, R\ G', B' voltages undergo further pro- The processing sections illustrated in FIG. 2 are now 

cessing to trans form them to a luminance and tw o chromatic 15 described in more detail below with reference to FIGS. 3, 4, 

images that are passed to subsequent proces sing stages or 5, 6 and 7. 

sections. FIG. 3 illustrates a block diagram of the input signal 

Theluminance processing section 220 accepts two images processing section 210. In the preferred embodiment, each 

(test and reference) of luminances Y, expressed as fractions input signal is processed in a set of four fields 305. Thus, the 

of the maximum luminance of the display. These outputs are 20 stack of four fields labeled Y', C b , C r ' at the top of FIG. 3 

passed to luminance metric generating section 240, where a indicates a set of four consecutive fields from either a test or 

luminance JND ma p is generated. The JND map is an image reference image sequence. However, the present invention is 

whose gray levels are proportional to the number of JNDs not limited to such implementation and other field grouping 

between the test and reference image at the corresponding methods can be used. 

pixel location. Multiple transformations are included in the input signal 
Similarly, the chrominance processing section 230 pro- processing section 210. In brief, the input signal processing 
cesses the chrominance components of the input signals into section 210 transforms Y', C t \ C r ' video inpu t signals first to 
a chrominance perceptual m etric. Namely, the chrominance electron- gun voltages, then to luminance val ues of three 
processing section 230 accepts two images (test and 3Q phos phors, and final l y into psychophy sical variables that 
reference) of chrominance based on the CIE L*u*v* separate into luminance and chrominance c omponents. The 
uniform-color space, (occurs for each of the chrominance tristimulus value Y, which is computed below, replaces the 
images u* and v*), and expressed as fractions of the "model intensity value*' used before chrominance process- 
maximum chrominance of the display. In turn, outputs of u* ing. In addition, chrominance components u* and v* are 
and v* processing are received and combined by the chromi- 35 generated, pixe l by pixel, acco rding to CIE un iform-color 
nance metric generating section 250 to produce the chromi- specifications. 

nance JNDjnap. It shoulcfbe noted that the input signal processing section 

Furthermore, both chrominance and luminance process- 210 can be implemented optionally, if the input signal is 

ing are influe nced by in puts from the luminance channel already in an acceptable uniform-color space. For example, 

called "masking" via path 225, which render perceived 40 the input signal may have been previously processed into the 

differe nces more or le'ss visible depending on the str ucture of proper format and saved onto a storage device, e.g., mag- 

the lu minance images . Masking (self or cross) generally netic or optical drives and disks. Furthermore, it should be 

refers to a reduction of sensitivity in the presence informa- noted that although the present invention is implemented 

tion in a channel or a neighboring channel. with pix els mapped into QELUV, an international-standard 

The chrominance, luminance and combined luma-chroma 45 uniform-color space, the present invention can be imple- 

JND maps are each available as output to the perceptual mented and adapted to process input signals that are mapped 

metric genexatingsection 260, together with a small number into other spaces. 

of summary measure s derived fro m these maps. Whereas the The first processing stage 310 transforms Y', C bi C r ' data, 

single JND vahiepND summaries) output is useful to model to R, G', B' gun voltages. More specifically, the steps 

an observer's overall rating of the distortions in a test 50 outlined below describe the transformation from Y', C & , C r 

sequence, the JND maps give a more detailed view of the image frames to R 1 , G', B' voltage signals that drive a CRT 

location and severity of the artifacts. In rum, the perceptual display. Here, the apostrophe indicates that the input signals 

metric generating section 260 correlates the luminance per- have been gamma-precorrected at the encoder. Namely, 

ceptual metric with the chrominance perceptual metric into these signals, after transformation, can drive a CRT display 

a unified perceptual image metric 270, e.g., an overall 55 device whose voltage-current transfer function can be 

just-noticeable-difference (JND) map. closely approximated by a gamma nonlinearity. 

It should be noted that two basic assumptions underlie the It is assumed that the input digital images are in 4:2:2 

present invention. First, each pixel is "square" and subtends format: full resolution on the luminance correlate Y 1 , and 

0.03 degrees of viewing angle. This number was derived half-resolution horizontally for the chrominance correlates 

from a screen height of 480 pixels, and a viewing distance 60 C b and C'^ where Y\ C'^, C' r data are assumed to be stored 

of four screen-heights (the closest viewing distance pre- in the order specified in ANSI/SMPTEStd. 125M- 1992, i.e., 
scribed by the "Rec. 500" standard). When the present 

perceptual metric generator is compared with human per- ^T^?' 0 ' ^l}, 0 * 1 ' ^ a,u Y ' 3 ' ' * ' ' Cbn/2 ' 1> Y »- 1 ' ° rnn - u 

ception at longer viewing distances than four screen heights, rmfa ** " 2 ' 

the perceptual metric generator may overestimate the 65 In the steps enumerated below, there are two embodi- 
human's sensitjyjt yjo spatial detail s. Thus, in the absence of ments or alternatives for chrominance upsampling and three 

hard constraints on viewing distance, the perceptual metric embodiments or alternatives for matrix conversion from Y' 
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C b C r to R' G' B\ These alternatives cover various common 
requirements, e.g., decoding requirements that might be 
encountered in various applications. 

More specifically, in the first chrominance upsampling 
embodiment, the Y* C b C r arrays from a single frame are 
received, where the C b and C r arrays are expanded to the 
full resolution of the Y image. The C b and C' r arrays are 
initially at half-resolution horizontally, and are then 
up-sampled to create the full-resolution fields. Namely, the 
alternate C 6 , C' r pixels on a row are assigned to the 
even-numbered Y' ( - in the data stream. Then, the C b , C r pair 
associated with the even-numbered Y,-, are computed either 
(i) by replication or (if) by averaging with its neighbors. 

In the second chrominance upsampling embodiment, the 
full-resolution Y\ C' fc , C r arrays are parceled into two fields. 
In the case of Y', the first field contains the odd lines of the 
T array, and the second field contains the even lines of the 
Y 1 array. Identical processing is performed on C b and C' r 
arrays to produce the first and second C b and C' r fields. 

In matrix conversion of Y' C b C r to R' G' B', the 
corresponding Y' C b C r values are converted to the gun 
input values R', G\ B' for each pixel in each of the two fields. 
The Y' C b C r values are taken to be related to the R'G'B' 
values by one of the following three alternative equations. 
The first two equations can be found in Video Demystified, 
by Keith Jack, HighText, San Diego, 1996 (Ch, 3, p. 40-42. 
Equation (3) corresponds to Equation 9.9 in A Technical 
Introduction to Digital Video, by C. A. Poynton, p. 176 
Wiley, 1996. (C b was substituted for U and C r was substi- 
tuted for V) In the preferred embodiment, equation (2) is 
selected as the default, which should be use unless mea- 
surement of a display of interest indicates otherwise. 
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The R', G', and B' arrays are then received by the second 55 
processing stage 320 in the input signal processing section 
210. The second processing stage 320 applies a point- 
nonli nearitv to e ach R', G', B 1 image. This second processing 
stage models the transfer of the R', G\ B' gun voltages into 
intensities (R, G, B ) of the display (fractions of maximum 60 
luminance). The nonlinearity also performs clipping at low 
luminances in each plane by the display. 

More specifically, the conversion between (R', G', B 1 ) and 
(R, G, B) contains two parts, one of which transforms each 
pixel value independently and one of which perf orms a 65 
spati al filtering on the tr ansformed pixel values. The two 
parts are described below 



8 



Pixel- Value Transformation 

First, the fraction of maximum luminance R correspond- 
ing to input R' is computed for each pixel. Similarly, the 
fractional luminances G and B are computed from inputs G\ 
B\ The maximum luminance from each gun is assumed to 
correspond to the input value 255. The following equations 
describe the transformation from (R*, G\ B') to (R, G, B): 



10 



|max(/f, t d ) 
255 



(4) 



G- 



L 255 



The default threshold value X d is selected to be 16 to 
correspond with the black level of the display, and y defaults 
to 2.5. 

The value of 16 for t d is selected to provide the display 
with a dynamic range of (255/16) 25 , which is approximately 
1000:1. This dynamic range is relatively large, and may not 
be necessary where the ambient illumination is approxi- 
mately 1% of the maximum display white. Therefore, the 
physical fidelity can be maintained even if the perceptual 
generator employs the value 40 as a black level instead of 
the value 16, which still provides a 100:1 dynamic range. In 
fact, a lower dynamic range will produce a saving in 
computational cycles, Le., saving one or two bits in the 
processing. 

Two observations about the display are discussed below. 
The first observation involves the dependence on absolute 
screen luminance. The predictions of the perceptual metric 
generator implicitly apply only to the luminance levels for 
which the perceptual metric generator was calibrated. 

For typical calibration data (J. J. Koenderink and A. J. van 
Doom, "Spatiotemporal contrast detection threshold surface 
is bimodal," Optics Letters 4, 32-34 (1979)), the retinal 
illuminance was 200 trolands, using a default pupil of 
diameter 2 mm. This implies a screen luminance of 63.66 
cd/m 2 , or 18.58 ft-L. The calibration luminance is compa- 
rable to the luminances of the displays used in the subjective 
rating tests. For example, although the maximum-white 
luminances of two experiments were 71 and 97 ft-L, the 
luminances at pixel value 128 were 15 and 21 ft-L, respec- 
tively. Taking these values into account and the fact that the 
perceptual metric generator's overall sensitivity was cali- 
brated from 0.01 to 100 ft-L (using data of F. L. van Nes, J. 
J. Koenderink, H. Nas, and M. A. Bouman, "Spatiotemporal 
modulation transfer in the human eye,"/. Opt. Soc. Am. 57, 
1082-1088 (1967)), it can be concluded that the perceptual 
metric generator applies to screen luminances from approxi- 
mately 20 to 100 ft-L. 

The second observation involves the relationship of Equa- 
tion (4) to other models. An offset voltage i d (e.g., from a 
grid setting between cathode and TV screen) can be used to 
transform Equation (4) into the model advanced by Poynton 
(C. A. Poynton, "Gamma" and its disguises: The nonlinear 
mappings of intensity in perception, CRTs, Film, and Video, 
SMPTE Journal, December 1993, pp. 1099-1108) where 
R-HR'+b^ (and similarly for G and B). One obtains Poyn- 
ton 's model by defining a new voltage R"-R'-t rf . Hence 
R-k[R"+tJ Y , and similarly for G and B. By writing Equation 
(4) rather than the equation of Poynton, it is assumed that an 
offset voltage is -t d . It is also assumed that there is no 
ambient illumination. 

In the presence of ambient illumination c, the voltage 
offset becomes negligible, and Equation (4) becomes 
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approximately equivalent to the model advanced by Meyer 
("The importance of gun balancing in monitor calibration," 
in Perceiving, Measuring, and Using Color (M. Brill, ed.), 
Proc. SPIE, Vol. 1250, pp. 69-79 (1990)), namely R=kR* Y +c. 
Similar expressions result for G and B. If ambient illumi- 
nation is present, then Equation (4) can be replaced by the 
model of Meyer, with k=(l/255) Y and c=0.01. 

The present perceptual metric generator provides three 
options for specifying the vertical representation of (R, G, 
B) images, for each frame (in progressive images) and for 
odd and even fields (in interlaced images). 

Option 1. Frame 

Images are full-height and contain one progressively 
scanned image. 

Option 2. Full-height Interlace 

Half-height images are interspersed with blank lines and 
become full-height, as they are in an interlaced screen. 
Blank lines are subsequently filled by interpolation as 
described below. 

Option 3. Half-height Interlace 

Half-height images are processed directly. 

The first two options are more faithful to video image 
structure, whereas the third option has the advantage of 
reducing processing time and memory requirements by 
50%. Luminance and chrominance processing are identical 
for options 1 and 2 since both options operate on full-height 
images. These three options are described in detail below. 
Spatial Pre-Filtering 

Spatial pre-processing is not required for the above 
options 1 and 3. However, there is spatial pre-filtering 
associated with the full-height interlace option 2. 

To accommodate the spread of light from line to inter-line 
pixels in a field, the R, G, and B field images are also 
subjected to a line interpolation process. Four different 
methods of interpolation are illustrated below, but the 
present invention is not limited to these interpolation meth- 
ods. In each method, an entire frame is read, and then each 
pixel on the lines belonging to the inactive field are replaced 
with values computed from the pixels immediately above 
and below. For methods (3) and (4), the computation also 
uses pixel values from the inactive field. 



Let P„ 



(1) Average P m 



(2) Duplicate P^ 



2 

J P above 



(3) 
(4) 



Hybrid average P a 
Median P inaaive 



if first line active ^ 
otherwise / 

activ* (P above + Pbtlov) 



uatve 2 4 

median {P imaiV€ , F^, Pbeiow) 
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20 



25 



35 



denote an inactive line pixel to be 



interpolated, and V above and P te/OM , denote the active line 
pixels above and below V inacUve , respectively. The four 
methods are: 



45 



50 



55 



Method (1) average is the default. 

Returning to FIG. 3, following the nonlinearity process, 
the third processing stage 330 models vertical electron-beam 
spot spread into interline locations by replacing the interline 
values in fields R, G, B by interpolated values from above 
and below. Then, the vector (R,G,B) at each pixel in the field 
is subjected to a linear transformation (which depends on the 
display phosphors) to CtE 1931 tristimulus coordinates (X, 
Y, Z). The luminance component Y of this vector is passed 
to luminance processing section 220 as discussed above. 

More specifically, the CIE 1931 tristimulus values X, Y, 
and Z are computed for each pixel, given the fractional 



65 



10 



luminance values R, G, B. This process requires the follow- 
ing inputs which are display device dependent: the chroma- 
ticity coordinates (x^ y r ), (x^, y ), (x^, y b ) of the three 
phosphors, and the chromaticity of the monitor white point 
(x^, y»v)' 

The white point is selected as corresponding to Illuminant 
D65, such that (x^, y J=(0.3128, 0.3292) (see G. Wyszecki 
and W. S. Stiles, Color Science, 2nd ed., Wiley, 1982, p. 
761). The values (x r , y>(0.6245, 0.3581), (x^, y )=(0.2032, 
0.716), and (x b , YJ-(0.1465, 0.0549) for the reef, green and 
blue phosphors, respectively, correspond to currently avail- 
able phosphors that closely approximate NTSC phosphors. 
However, Table 1 below illustrates other display phosphor 
coordinate (phosphor primary chromaticity) options. ITU-R 
BT.709 (Rec. 709) is the default. 



Source 






(«b»yh) 


nU-R BT.709 


(0.640,0.330) 


(0.300,0.600) 


(0.150,0.060) 


(SMFTE274M) 








SMPTE 240M 


(0.630,0.340) 


(0.310,0595) 


(0.1 55.0.070) 


EBU 


(0.640,0.330) 


(0.290,0.600) 


(0.150,0.060) 



Using the above parameter values, the values X, Y, Z of 
the pixel are given by the following equations: 
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where, Z~(l-x r -y r ), z,-(l-x,-y,), z b =(Ux b -y b ), and the 
values Y 0r , Y Qg , Y ob are given by 





"fr 




ft ' 


-i 








yr 


y» 






y* 


yog 




1 


i 


i 




l 






Zr 


fi 






Zw 




Jr 


y t 






. y~ 



(6) 



where z w -(l-x w -y) (See D. Post, Colorimetric 
measurement, calibration, and characterization of self- 
luminous displays, in Color in Electronic Displays, H. 
Widdel and D. L. Post (eds), Plenum Press, 1992, p. 306) 

The tristimulus values X„, Y„, Z n of the white point of the 
device are also needed. These values correspond to the 
chromaticity (x^ yj) and are such that, at full phosphor 
activation (R'=G'=B'=255), Y=l. Thus, the tristimulus val- 
ues for the white point are (X„, Y M , ZJKxJy*, 1, zjy w )< 

As an optional final stage in deriving the values X, Y, Z, 
an adjustment can be made to accommodate an assumed 
ambient light due to veiling reflection from the display 
screen. This adjustment takes the form: 



60 



(6a) 



Here, two user-specifiable parameters, L max and L a , are 
introduced and assigned default values. L max , the maximum 
luminance of the display, is set to 100 cd/m 2 to correspond 
to commercial displays. The veiling luminance, L fl) is set to 
5 cd/m 2 , consistent with measured screen values under Rec 
500 conditions. 
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The chromaticity of the ambient light is assumed to be the 
same as that of the display white point. It should be noted 
that in the luminance-only implementation option, which 
does not compute the neutral point (X„, Y„, Z„), the adjust- 
ment: 



^max 



(&>) 



is made instead of Equation (6a). This is equivalent to the Y 
component of Equation (6a) because Y M is always 1. It 
should also be noted that the quantity L^*Y is the lumi- 
nance of the display in cd/m 2 . 

Returning to FIG. 3, to ensure (at each pixel) approximate 
perceptual uniformity of the color space to isoluminant color 
differences, the individual pixels are mapped into CIELUV, 
an international-standard uniform-color space (see 
Wyszecki and Stiles, 1982) in the fourth processing stage 
340. The chrominance components u*, v* of this space are 
passed to the chrominance processing section 230. 

More specifically, the X, Y, Z values, pixel-by-pixel, is 
transformed to the 1976 CIELUV uniform-color system 
(Wyszecki and Stiles, 1982, p. 165): 
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It should be noted that the coordinate L* is not passed to 
the luminance processing section 220. L* is used only in 
computing the chrominance coordinates u* and v*. 
Consequently, only u* and v* images are saved for further 
processing. 

FIG. 4 illustrates a block diagram of the luminance 
processing section 220. FIG. 4 can be perceived as a 
flowchart of luminance processing steps or as a block 
diagram of a plurality of hardware components for perform- 
ing such luminance processing steps, e.g., filters, various 
circuit components and/or application specific integrated 
circuits (ASIC). 

Referring to FIG. 4, each luminance field is filtered and 
down-sampled in a four-level Gaussian pyramid 410, in 
order to model the psychophysically and physiologically 
observed decomposition of incoming visual signals into 
different spatial-frequency bands 412-418, After the 
decomposition, subsequent optional processing can be 
performed, e.g., oriented filtering, applied at each pyramid 
level. 

Next, a non-linear operation 430 is performed immedi- 
ately following the pyramid decomposition. This stage is a 



12 



50 



55 



60 



gain-setting operation (normalization) based on a time- 
dependent windowed average (across fields) of the maxi- 
mum luminance within the coarsest pyramid level. This 
normalization sets the overall gain of the perceptual metric 
generator and models effects such as the loss of visual 
sensitivity after a transition from a bright to a dark scene. 

It should be noted that an intermediate normalization 
process 420 is performed to derive an intermediate value 
\ norm . The \ ¥tann value is employed to scale each of the four 
pyramid levels as discussed below. 

After normalization, the lowest-resolution pyramid image 
418 is subjected to temporal filtering and contrast compu- 
tation 450, and the other three levels 412-4161 are subjected 
to spatial filtering and contrast computation 440. In each 
case, the contrast is a local difference of pixel values divided 
by a local sum, appropriately scaled. In the formulation of 
the perceptual metric generator, this established the defini- 
tion of "1 JND", which is passed on to subsequent stages of 
the perceptual metric generator. (Calibration iteratively 
revises the 1-JND interpretation at intermediate perceptual 
metric generator stages, which is discussed below). In each 
case, the contrast is squared to produce what is known as the 
contrast energy. The algebraic sign of the contrast is pre- 
served for reattachment just prior to image comparison (JND 
map computation). 

The next stages 460 and 470 (contrast-energy masking) 
constitute a further gain-setting operation in which each 
oriented response (contrast energy) is divided by a function 
of all the contrast energies. This combined attenuation of 
each response by other local responses is included to model 
visual "masking" effects such as the decrease in sensitivity 
to distortions in "busy" image areas. At this stage of the 
perceptual metric generator, temporal structure (flicker) is 
made to mask spatial differences, and spatial structure is also 
made to mask temporal differences. Luminance masking is 
also applied on the chrominance side, as discussed below. 

The masked contrast energies (together with the contrast 
signs) are used to produce the luminance JND map 480. In 
brief, the luminance JND map is produced by: 1) separating 
each image into positive and negative components (half- 
wave rectification); 2) performing local pooling (averaging 
and downsampling, to model the local spatial summation 
observed in psychophysical experiments); 3) evaluating the 
absolute image differences channel by channel; 4) thresh- 
olding (coring); 5) raising the cored image differences to a 
power; and 6) up-sampling to the same resolution (which 
will be half the resolution of the original image due to the 
pooling stage). 

FIG. 19 illustrates a block diagram of an alternate 
embodiment of the luminance processing section 220. More 
specifically, the normalization stages 420 and 430 of FIG. 4 
are replaced with a luminance compression stage 1900. In 
brief, each luminance value in the input signal is first 
subjected to a compressive nonlinearity, which is described 
below in detail. Other stages in FIG. 19 are similar to those 
in FIG. 4. As such, the description of these similar stages are 
provided above. For dissimilar stages, a detail description of 
the luminance processing section of FIG. 19 is provided 
below with reference to FIG. 20. 

In general, the luminance processing section of FIG. 19 is 
the preferred embodiment. However, since these two 
embodiments exhibit different characteristics, their perfor- 
mance may differ under different applications. For example, 
it has been observed that the luminance processing section 
of FIG. 4 performs well at higher dynamic ranges, e.g., 
10-bit input images versus a lower dynamic range. 

FIG. 5 illustrates a block diagram of the chrominance 
processing section 230. FIG. 5 can be perceived as a 
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flowchart of chrominance processing steps or as a block reference field images are denoted by l k and I"^ respec- 

diagram of a plurality of hardware components for perform- tively (k=0, 1, 2, 3). Pixel values in I A and \ ref k are denoted 

ing such chrominance processing steps, e.g., filters, various by I*(i j) and I^^ij), respectively. These values are the Y 

circuit components and/or application specific integrated tristimuhis values 605 computed in input signal processing 

circuits (ASIC). Chrominance processing parallels lumi- 5 section 210. Only the fields \ k are discussed below, since Y** k 

nance processing in several aspects. Intra-image differences processing is identical. It should be noted that k=3 denotes 

of chrominance (u* 502 and v* 504) of the CIELUV space the most recent field in a 4-field sequence, 

are used to define the detection thresholds for the chromi- Spatial decomposition at four resolution levels is accom- 

nance operation, in analogy to the way the Michelson plished through a computationally efficient method called 

contrast (and Weber's law) is used to define the detection 10 pyramid processing or pyramid decomposition, which 

threshold in the luminance processing section. Also, in smears and downsamples the image by a factor of 2 at each 

analogy with the luminance operation, the chromatic "con- successively coarser level of resolution. The original, full- 

trasts" defined by u* and v* differences are subjected to a resolution image is called the zeroth level (level 0) of the 

masking operation. A transducer nonlinearity makes the pyramid, G 0 «I 3 (i,j). Subsequent levels, at lower resolutions, 

discrimination of a contrast increment between one image 15 are obtained by an operation called REDUCE. Namely, a 

and another depend on the contrast energy that is common three-tap low-pass filter 610 with weights (l,2,l)/4 is 

to both images, applied to G 0 sequentially in each direction of the image to 

More specifically, FIG. 5 shows, as in the luminance generate a blurred image. The resulting image is then 

processing section, each chrominance component u* 502, v* subsampled by a factor of 2 (every other pixe is removed) to 

504 being subjected to a pyramid decomposition process 20 create the next level, G 3 . 

510. However, whereas luminance processing implements a Denoting fdsl( ) as the operation of filtering and down- 
four pyramid level decomposition in the preferred sampling by one pyramid level, the REDUCE process can be 
embodiment, chrominance processing is implemented with represented as 
seven (7) levels. This implementation addresses the empiri- 
cal fact that chromatic channels are sensitive to far lower 25 G, + i=fdsi(G,)> for i=i, 2, 3. (13a) 
spatial frequencies than luminance channels (K. T Mullen, 

"The contrast sensitivity of human colour vision to red- REDUCE process is applied recursively to each new 

green and blue-yellow chromatic gratings," J. Physiol. 359, level ( as described by P. J. Burt and E. H. Adelson, "The 

381^00, 1985). Furthermore, such decomposition takes Laplacian pyramid as a compact image code/' IEEE Trans- 

into account the intuitive fact that color differences can be 30 actioas on Communications, COM-31, 532-540 (1983), 

observed in large, uniform regions. Conversely, an operation EXPAND is defined that 

Next, to reflect the inherent insensitivity of the chromi- upsamplcs and filters by the same 3x3 kernel. This operation 

nance channels to Sicker, temporal processing 520 is accom- ^ b ^ } aQd bdow 

plished by averaging over four image fields. ' * * i • , - * 

Then, spatial filtering by a Laplacian kernel 530 is per- 35 y The fdsl aod ^fl filter kernels in each direction 

formed on u* and v*. This operation produces a color (horizontal and vertical) are k, [1,2,1] and k„ [1,2 1], 

difference in u», v*, which (by definition of the uniform respectively, where constants k d and k are chosen so that 

color space) is metrically connected to just-noticeable color uniform-field values are conserved. For fdsl the constant is 

differences. A value of 1 at this stage is taken to mean a k ^ 025 > and for ufe1 ' the constant K V?' 5 ( becau ! e of the 

single JND has been achieved, in analogy to the role of 40 zeros m the upsampled image). To implement usfl as an 

Weber's-law-based contrast in the luminance channel. (As in in "P lace operation, the kernel is replaced by the equivalent 

the case of luminance processing, the 1-JND chrominance hoear interpolation to replace the zero values. However, for 

unit must undergo ^interpretation during calibration.) conceptual simplicity, it can be referred to as an upsample- 

This color difference value is weighted, squared and filler * 
passed (with the contrast algebraic sign) to the contrast- 45 Next, normalization is applied, where an intermediate 
energy-masking stage 540. The masking stage performs the value (denoted by I /v/3 ) is computed by averaging four 
same function as in the luminance processing section. The values, the maximum pixel values in the Level 3 images for 
operation is somewhat simpler, since it receives input only each field (k«0,l,2,3). This step mitigates the effect of 
from the luminance channels and from the chrominance outliers in the full resolution (Level 0) image by the smooth- 
channel whose difference is being evaluated. Finally, the 50 ing inherent in the pyramid decomposition process. I M? is 
masked contrast energies are processed exactly as in the then compared with a decremented value of the normaliza- 
luminance processing section to produce a chrominance tion factor, used in the previous epoch (k=2). Inorm 
JND map in stage 550. for the current epoch (k«3) is set equal to the larger of these 

For each field in the video-sequence comparison, the two values. Images for all 4 pyramid levels for the latest field 

luminance and chrominance JND maps are first reduced to 55 are then scaled by using this new value of l norm , and 

single-number summaries, namely luminance and chromi- subjected to a saturating nonlinearity. 

nance JND values. In each case, the reduction from map to The following equations describe this process. If the 

number is done by summing all pixel values through a pyramid levels from above are I 3 X^J)* where 3 and 1 denote 

Minkowski addition. Then, the luminance and chrominance the latest field and pyramid level, respectively, then 

JND numbers are then combined, again via a Minkowski 60 

addition, to produce the JND estimate for the field being / 3 ,(/ t jy (i4) 

processed by the perceptual metric generating section 260. /3/( *' & *~ / 3( ,(/, /r + +Ld * 
A single performance measure 270 for many fields of a video 
sequence is determined by adding, in the Minkowski sense, 

the JND estimates for each field. 65 (620) where I w)mi -max[aI, K) , wl 0 , I /W3 ]615, is the 

FIG. 6 illustrates a detailed block diagram of the lumi- value of used in the previous epoch to normalize the 

nance processing section 220 of FIG. 4. Input test and field-3 pyramid levels, m defaults to 2, and 
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-continued 

, w = ig T „ lj(i , ,u«.d - (Jjf^ 1 . <15) Vi =(SHj) 2 - ,=0 ' 1 ' 2 - and 

5 

, . . . ,rt^nr « , , . Similarly, the temporal contrast -energy image 650 is com- 

At is the reciprocal of the field frequency, and \ hal fM. is _, t , . „ • „ t . . . 

i i • # it - 4 r *i_ l * i . puted by squanng the pixel values: 

related to the adaptation rate of the human visual system /nor 

following removal of a bright stimulus. Values of a for 50 2 

and 60 Hz, respectively, are 0.9727 and 0.9772. The constant 1Q t 3 = ( (£i ~ L3) ) . (22) 

L D represents a residual visual response (noise) that exists in ™* ^ + ^ 

the absence of light, and defaults to a value of 0.01. The 

saturating nonlincarity in Equation (14) is derived from ^ rf braic si of each mio ^ ^ iof 

physiologically based models (see Shapley and Enroth- to j ^ retaine(J fof ^ ^ 

Cugell, 1984). 

Oriented spatial filters (center and surround) are applied ' Contrast-energy masking is a nonlinear function applied 
to the level 0, 1, and 2 images for field 3. In contrast, a 10 each of ' he c T™ Si ™ T *™ ° r images ""P** w f 
temporal filter is applied to the lowest resolution level (level equations 21 and 22. The masking operation models the 
3). Namely, the first and last pairs of fields are combined effect of spatiotemporal structure in the reference image 
linearly into Early and Late images, respectively. 20 sequence on the discrimination of distortion in the test image 

The center and surround filters 625 and 627 are separable sequence. 
3x3 niters and yield all combinations of orientation: Center For example, a test and a reference image differ by a 
Vertical (CV), Center Horizontal (CH), Surround Vertical low- amplitude spatial sine wave. It is known (Nachmias and 
(SV), and Surround Horizontal (SH). The filter kernels are Sansbury, 1974) that this difference is more visible when 
as follows: 25 both images have in common a mid -contrast sine wave of 

the same spatial frequency, than if both images contain a 
(16) uniform field. However, if the contrast of the common sine 
wave is too great, the image difference becomes less visible. 
It is also the case that sine waves of other spatial frequencies 
30 can have an effect on the visibility of the contrast difference. 
This behavior can be modeled by a nonlinearity that is 
The level 3 Early 630 and Ute 632 images are, respectively, sigmo i d at low contrast energies, and an increasing power 

function for high contrast energies. Furthermore, the fol- 
E*=U^i( l 0>(i-Oi3,o(ij)» (17) lowing criteria can be observed approximately in human 

t n rt;w-i »v n -,\ ^ 35 vision. Each channel masks itself, high spatial frequencies 

mask low ones (but not the reverse), and temporal flicker 
The constants \ e and t, for 60 Hz are 0.5161 and 0.4848, masks spatial contrast sensitivity (and also the reverse). The 
respectively, and for 50 Hz are 0.70 and 030, respectively. foregoing spatial filtering can be enhanced to respond in a 

Inputs for the contrast computation are the center and 4 0 visuaU y faithful w ^ t0 P oint or Une flicker b ? P rocessin g 
surround images CV,., CH £ , SV,., and SH ; (i=0,l,2 for pyra- information from multiple image fields (e.g., two image 
mid levels 0, 1, and 2), and the Early and Late images E 3 and fields), without disturbing the response to pure-spatial or 
L3 (for pyramid level 3). The equation used to compute the pure-temporal images. 

contrast ratio is analogous to the Michelson contrast. For the Generalizing Eq. 19, the invention defines pyramids CH2, 
horizontal and vertical orientations, the respective contrasts, 45 §H2, CV2, and SV2 as the result of applying the kernels CH, 
pixel-by-pixel, are: SH , CV, and SV (respectively) defined by Eq. 16 to the 

image pyramids for field 2, and pyramids CH3, SH3, CV3, 
(SHj-CH-,) ^ {SVi-cVj) (19) and sv3 ^ the resuh of applying the kernels CH, SH, CV, 

WiiCH^SH-y wtcvi+svi)- and sv ( respectively ) defined by Eq. 16 to the image 

50 pyramids for field 3. 

Similarly, the contrast ratio for the temporal component is: As depicted in the block diagram of FIG. 28, the invention 

applies all these operators to, for example, the last two fields 
(£ 3 - ti) (20) (stored as pyramids) of the four-field image sequence. Prior 

h-3(£ 3 +U)' 55 to application of the operators, the image sequence is 

downsampled using downsampler 2802. The pyramid levels 
r . . are then processed in an image field processor such 2804 that 

The values of w f - for 1=0,1,2,3, as determined by calibra- the field 2 fic]d 3 information is separately produced, 
tion with sychophysical test data, are 0.015, 0.0022, 0.0015, ^ field fa respectively mt ered by field 2 filters 2806 and 
and 0.003, respectively. 6Q field 3 filtcrs 2 808, i.e., the foregoing operators are applied 

Horizontal and vertical contrast-energy images 640 and us i n g filters 625 and 627 as described with respect to FIG. 
642 are computed by squaring the pixel values defined by 6. 

the two preceding equations, thus obtaining: More specific m a maane r exactly the same as Eq. 19, 

for pyramid level i-0, 1, 2, the alternative embodiment 2800 
^ / {Sty - CHj) -)} (21) 65 uses a Comcast computer 2810 to perform a contrast com- 

\Wi(CHt + SHi)) parison using information from the two fields by defining 

oriented contrasts 



11/14/2003, EAST Version: 1.4.1 



US 6,360,022 Bl 
17 18 

(compressive nonlinearity) stage 2000 in FIG. 20. Namely, 
{SH3j - CH3 t - SH2j + CH2 t ) the nonlinearity comprises a decelerating power function 

wSTi(SH3i +c//J ; +SH2i + CH2i) offset by a constant. Let the relative-luminance array from 

(SV3i-CV3i -sv2i+cv2i) me latest field be Y 3 (i,j), where 3 denotes the latest field. 



wSTiiSV3i + CV3-, + SV2i + CV2 S ) 



Then: 



The contrast information is further processed by a non- 
linear processor 2812. The processor 2812 processes both 
the output of the temporal filters 2814 (which operate as 
discussed with respect to FIG. 6) and the contrast informa- 
tion to produce information that is used to generate a 
luminance JND map 2816. The process used to produce the 
JND map is described below. 

The multiple field process is calibrated on the data of 
Koenderink and van Doom (1979) to find the new coeffi- 
cients wST^. Note that all these contrasts are zero for any 
stimulus that has either pure spatial or pure temporal varia- 
tions. 

Lastly, the invention incorporates the same formalism for 
masking as is already used on the other spatial channels. 

In response to these properties of human vision, the 
following form of the nonlinearity (applied pixel by pixel) 
660 is applied: 

d y yfi (23) 
7Yy, £>;) = — 

Here, y is the contrast energy to be masked: spatial, H f - or V i 
(Equation (21)) or temporal (T 3 ) (Equation (22). The quan- 
tity D ( - refers (pixel by pixel) to an image that depends on the 
pyramid level i to which y belongs. Quantities (5, a, a, and 
c were found by perceptual metric generator calibration to 
be 1.17, 1.00, 0.0757, and 0.4753, respectively, and d y is the 
algebraic sign that was inherent in contrast y before it was 
squared. 

Computation of D t requires both pyramid construction 
(filtering and downsampling) and pyramid reconstruction 
(filtering and upsampling). This computation of D, is illus- 
trated in FIG. 6. First, E 0 is computed as the sum of H 0 and 
V 0 . This sum is filtered, downsampled by stage 652, and 
added to H 1 +V 1 to give E v Next, E 1 is further filtered, 
downsampled, and added to H 2 +V 2 to give E 2 . In turn, Ej is 
further filtered and downsampled to give E 3 . Meanwhile, the 
image of temporal contrasts T 3 is multiplied by m„ and 
added to m fi E 3 to produce a sum which is denoted D 3 . 

In turn, D 3 is upsampled and filtered by stage 654 
repeatedly to produce T^ T a , and T 0 . Finally, the images D f 
are defined as D—m^ Ej+T^ i-0,1,2,. Here, my is determined 
by calibration to be equal to 0.001, m^ is set equal to 0.0005, 
and m, is set equal to 0.05. The filtering, down-sampling and 
upsampling steps, are identical to those previously dis- 
cussed. 

The above processing illustrates that the higher spatial 
frequencies mask the lower ones (since D,- are influenced by 
pyramid levels less than or equal to i), and the temporal 
channel is masked by all the spatial channels. This masking 
operation is generally in accord with psychophysical obser- 
vation. The quantities D,-, i-0,1,2, also mask chrominance 
contrasts (but not the reverse) as discussed below. 

FIG. 20 illustrates a detailed block diagram of the alter- 
nate embodiment of the luminance processing section 220 of 
FIG. 19. Since the luminance processing section of FIG. 19 
contains many similar stages to that of the luminance 
processing section of FIG. 6, a description is provided below 
only for the dissimilar stages. 

One significant difference is the replacement of the nor- 
malization stages of FIG. 6 with a luminance compression 



h&j)-[Lm«y&j)r + ag- 



io Lfnax, the maximum luminance of the display, is set to 100 
cd/m 2 . The present function is calibrated with the contrast- 
sensitivity data at 8 C/deg. Thus, the adjustable parameters, 
m and L D are found to be 0.65 and 7.5 cd/m 2 , respectively. 
Namely, the values of L d and m were chosen so as to match 

15 contrast detection data at luminance levels from 0.01 to 100 
ft-L (van Nes and Bouman, 1967). In other words, equation 
(23a) allows one to calibrate against an absolute luminance, 
e.g., changing the maximum luminance of the display will 
affect the total luminance output. Another way to view 

20 equation (23a) is that it allows the perceptual metric gen- 
erator to incorporate a luminance -dependent contrast- 
sensitivity function. 

Alternatively, additional the luminance compression 
stages 2000 (shown in dashed boxes in FIG. 20) can be 

25 inserted at each pyramid level to allow the present percep- 
tual metric generator to model the contrast sensitivity as a 
function of both luminance and spatial frequency. 
Otherwise, implementing one luminance compression stage 
2000 with only two parameters will be insufficient to model 

30 other spatial frequencies. 

More specifically, after pyramid decomposition of each 
luminance image, a nonlinearity is applied to each pyramid 
level k. Then, for pyramid level k, the compression nonlin- 
earity is given by 

35 

I^(y;kHL m ^Y 3 (ij;k) + L a r ( * ) +L D (k)'-<* ) , (23b) 

where again m(k) and L D (k) are chosen so as to match 
40 contrast detection at luminance levels from 0.01 to 100 ft-L 
(van Nes et al., 1967). The value L a is an onset for ambient 
screen illumination (set to 5 cd/m 2 based on screen 
measurements), and L max is the maximum luminance of the 
display (which generally is about 100 cd/m 2 ). 
45 The data to calibrate equation (23b) are tabulated below: 
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4(c/deg) 


c m 


8500.00 


0.500000 


1.46780E-02 


850.000 


0.500000 


1.46780E-02 


S5.0000 


0.500000 


1.46780E-02 


8.50000 


0.500000 


1.46780E-02 


0.85000 


0.500000 


1.46780E-02 


0.08500 


0.500000 


1.67028E-02 


8500.00 


4.00000 


2.61016E-03 


850.000 


4.00000 


2.61016E-03 


85.0000 


4.00000 


2.61016E-03 


8.50000 


4.00000 


4.15551E-03 


0.850OO 


4.00000 


131409E-02 


0.08500 


4.00000 


4.15551 E-02 


8500.00 


8.00000 


2.61016E-03 


850.000 


8.00000 


2.610 16E-03 


85.0000 


8.00000 


2.61016E-03 


8.50000 


8.00000 


6.71363E-03 


0.85000 


8.00000 


2.12304E-02 


0.08500 


8.00000 


6.71363 E-02 


8500.00 


16.0000 


3.83119E-03 


850.000 


16.0000 


3.83119E-03 


85.0000 


16.0000 


457394E-03 



11/14/2003, EAST Version: 1.4.1 



US 6,360,022 Bl 



19 



-continued 




fo(c/deg) 




8.50000 


16.0000 


1.44641 E-02 


0.85000 


16.0000 


4.57394E-02 


0.08500 


16.0000 


0.144641 


8500.00 


24.0000 


6.81292E-03 


850.000 


24.0000 


6.81292E-03 


85.0000 


24.0000 


3. 44641 E-02 


8.50000 


24.0000 


4.57394E-02 


0.85000 


24.0000 


0.144641 


0.08500 


24.0000 


0.457394 


8500.00 


32.0000 


1.21153E-02 


850.000 


32.0000 


1.21153E-02 


85.0000 


32.0000 


2.97023E-02 


8.50000 


32.0000 


9.39270E-02 


0.85000 


32.0000 


0.297023 


0.08500 


32.0000 


0.939270 


8500.00 


40.0000 


3.16228E-02 


850.000 


40.0000 


3.16228E-02 


85.0000 


40.0000 


8.95277E-02 


8.50000 


40.0000 


0.283111 


0.85000 


40.0000 


0,89527 


8500.00 


48.0000 


7.49894E-02 


850.000 


48.0000 


8.13375E-02 


85.0000 


48.0000 


0.257212 


8.50000 


48.0000 


0.813374 
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Each contrast modulation C m in the above table is the 
experimental value that resulted in just-discriminable con- 
trast of the sine wave of spatial frequency f, and retinal 
illuminance I 0 . It should be noted that since a 2-mm artificial 
pupil is used in the calibration, the retinal illuminance values 
(Io in trolands) are multiplied by ji to retrieve the luminance 
values (L in cd/m 2 ). A good starting point for calibration is 
to use for all the m(k) and L D (k) the default values for 8 
c/deg sine-wave detection, for which the proper exponent m 
is 0.65, and the proper value of L D is 7.5 cd/m 2 . 

The luminance spatial and temporal filtering are identical 
for both perceptual metric generators of FIG. 6 and FIG. 20. 
However, luminance contrast computation for the perceptual 
metric generator of FIG. 20 is achieved without the square 
operation. The stages 640, 642 and 650 are replaced by 
stages 2040, 2042 and 2050 in FIG. 20. 

More specifically, contrast-response images are computed 
as clipped versions of the absolute values of the quantities 
defined by the above equations (19) and (20). These quan- 
tities are computed as: 



Hj = rr 

V ( =m 
i = 0, 1 



{SH; - CHi) 



y-4 



Jo,\ (SVi - CVi) \-e\ 
2, and 



(23c) 



50 



I (E}-U) 1 
1^(^3+^3)1 



- el where e = 0.75. 



(23d) 
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The algebraic sign of each contrast ratio pixel value prior 
to the absolute-value operation must also be retained for 
later use. 

Another significant difference between the perceptual 
metric generators of FIG. 6 and FIG. 20 is the implemen- 
tation of the contrast energy masking. Unlike FIG. 6, the 
perceptual metric generators of FIG. 20 implements contrast 
energy masking 2060 in two separate stages: a cross mask- 
ing stage and a self masking stage for each of the horizontal 
and vertical channels (See FIG. 20). Self masking reduces 
sensitivity in the presence of information within a current 
channel, whereas cross masking reduces sensitivity in the 



60 



65 



20 



presence of information in a neighboring channel. In fact, 
the order of these two separate masking stages can be 
inverted. These contrast energy masking stages have the 
following forms: 



T(y, Dj) = 



, (self masking) 



(23e) 



10 where, 



y)] 



[l+m,(D;- 

- y 

' (l+D3-m ( » 



for i = 0, 1, 2, and 



(cross masking). 
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Here, y is the contrast to be masked: spatial, H, or V,- 
(Equation (23c)) or temporal (T 3 ) (Equation (24d)). The 
quantity D, refers (pixel by pixel) to an image that depends 
on the pyramid level i to which y belongs. Quantities b, a, 
c, mp and m, were found by model calibration to be 1,4, 
3/32, 5/32, 10/1024, and 50, respectively, d^ is the algebraic 
sign of contrast y that is saved before taking the absolute 
value. 

Computation of D; is similar to that of FIG. 6 as discussed 
above. Namely, fdsl( ) denotes a 3x3 filtering followed by 
downsampling by one pyramid level, and usfl( ) denotes 
upsampling by one pyramid level followed by a 3x3 filter- 
ing. First, array Eq is computed as: 

E^Ho+Va (230 

Then, for i=l, 2, the arrays E ; are computed recursively: 

E i =H / +V>fdsl(E,J, for i=l,2. (23g) 
E3-fdsl(Ej) (23h) 



The arrays E, are then combined with the temporal 
contrast image T 3 and images T, to give the contrast denomi- 
nator arrays D f , as follows: 



D 3 -m l T 3 +m/dsl(E 2 ), 

Ty-usflCDa), T>usflCr,^ for i-1,0, and 

D.-E..+T,., for i-0,1,2. 



(23i) 



(235) 



Here, parameter m^-3/64, modulates the strength with 
which the temporal (flicker) luminance-channel is masked 
by all the spatial-luminance channels together; and param- 
eter m,-50, modulates the strength with which each of the 
spatial-luminance channels is masked by the temporal 
(flicker) luminance-channel. 

FIG. 7 illustrates a detailed block diagram of the lumi- 
nance metric generating section 240. Again, FIG. 7 can be 
perceived as a flowchart of luminance metric generating 
steps or as a block diagram of the luminance metric gener- 
ating section having a plurality of hardware components for 
performing such luminance metric generating steps, e.g., 
filters, various circuit components and/or application spe- 
cific integrated circuits (ASIC). The construction described 
below applies to all the masked-contrast images generated in 
FIG. 6 above: the images in pyramids H and V (i.e., images 
H 0 , V 0 , H lf Vj, H 2 , and Va), the image T 3 (having resolution 
at level 3), and the corresponding images derived from the 
reference sequence (denoted with superscript ref in FIG. 6 
and FIG. 7). 

The first four steps in the following process apply to the 
above images separately. In the following discussion X 
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denotes any of these images derived from the test sequence, More specifically, the "coring" stage 740 and "raise to a 

and by X nf the corresponding image derived from the 0 th power" stage 750 are replaced by a plurality of max and 

reference sequence. Given this notation, the steps are as sum stages which maintain a running sum and a running 

follows: maximum of the channel outputs. Since the process illus- 

In step (or stage) 710, the image X is separated into two 5 trated by FIG. 21 is the same as FIG. 7 up to stage 730, the 

half-wave-rectified images, one for positive contrasts 712 process of FIG. 21 is now described starting from the point 

and the other for negative contrasts 714. In the positive- where the absolute -difference images |X + -X + '^ and |X_- 

contrast image (called X + ), the signs from the X contrast X_"^ have been determined. 

(separately stored as discussed above) are used to assign Next, after the process has been completed for all pairs of 

zeros to all pixels in X + that have negative contrasts. The 10 X, X™', a running-sum image is initialized in stage 2140 to 

opposite operation occurs in the negative-contrast image X_. contain the sum of the level-3 images derived from T 3 , T^. 

In step (or stage) 720, for each image X + and X_, a local Similarly, a running-maximum image is initialized in stage 
pooling operation is performed by applying a 3x3 filter to 2142 to contain a running-maximum image as the point -by- 
convolve the image with a filter kernel of 0.25(1,2,1) hori- point maximum of |T 34 .-T 3+ " / | and |T 3 _-T 3 _'* > |. 
zontally and vertically. 15 Next, the running-sum and running-maximum images are 

Furthermore, in step 720, the resulting images are down- upsampled and filtered by stages 2140a and 2142a 
sampled by a factor of 2 in each direction, to remove respectively, to comprise two level-2 images. The running- 
redundancy resulting from the pooling operation. The same sum image is then updated by stage 2144 by adding to it the 
processing as applied to X is performed for the correspond- level-2 images derived from H 2 , H 2 '* / , V 2 and V 2 "^. 
ing reference image X w/ . 20 Similarly, the running-maximum image is updated by stage 

In step (or stage) 730, the absolute-difference images 2146 by comparing it with the level-2 images derived from 

|X + -X/*| and |X_-X_~1 are computed pixel-by-pixel. The H 2 , H z " f , V 2 and V 2 re/ . 

resulting images are JND maps. Next, the running-sum and running-maximum images are 

In step (or stage) 740, a coring operation is performed on upsampled and filtered by stages 2144a and 2146a 
the JND maps. Namely, all values less than a threshold t c are 25 respectively, to comprise two level- 1 images. The running- 
set to zero. In the preferred embodiment, t c defaults to a sum image is then updated by stage 2148 by adding to it the 
value of 0.5. level- 1 images derived from H lf W™ f , V 3 and V^. 

In step (or stage) 750, the Q-th power of these images is Similarly, the running-maximum image is updated by stage 

determined. In the preferred embodiment, Q is a positive 2150 by comparing it with the level- 1 images derived from 

integer that defaults to a value of 2. 30 H a , H^, V A and V^. 

After this process has been completed for all pairs X, X re/ , Next, the running-sum and running-maximum images are 
summary measures are determined by repeatedly upsampled and filtered by stages 2148a and 2150a 
upsampling, filtering, and adding all the images up to the respectively, to comprise two level-0 images. The running- 
required level. This is accomplished as follows: sum image is then updated by stage 2152 by adding to it the 

In step (or stage) 760, upsampling and filtering are applied 35 level-0 images derived from H 0 , H 0 "^, V 0 and V 0 "^. 

to the level-3 images derived from T 3 , T 3 w/ to derive a Similarly, the running-maximum image is updated by stage 

level-2 image. 2154 by comparing it with the level-0 images derived from 

In step (or stage) 761, upsampling and filtering are applied H 0 , Hq"^, V 0 and Vo"^. 

to the sum of the level-2 image from step 760 with the Finally, a point-by- point linear combination of the 

level-2 images derived from H 2 , H^, V 2 and V 2 r ^. 40 running-sum and running-max images is performed in stage 

In step (or stage) 762, upsampling and filtering are applied 2160 to produce the luminance JND map in accordance 

to the sum of the level-2 image from step 761 with the with: 

level-2 images derived from H,, H,"' / , V, and V,^. ^ „ „ , ,^ -v 

In step (or stage) 763, upsampling and filtering are applied 

to the sum of the level-2 image from step 762 with the 45 where k^O.783. The value for k is determined by approxi- 

leveI-2 images derived from Hq, Hq**, V 0 and V^. The mating a Minkowski Q-norm. Given a value of Q and a 

output on path 765 from step (or stage) 763 is a luminance number of images N to be brought together, the value 

JND map, k^-[N-N 1/G ]/[N-l] ensures that the approximate measure 

It should be noted that before the final processing step matches the Q-norm exactly when all the compared entries 

763, the resulting image is half the resolution of the original 50 (at a pixel) are the same, and also when there is only one 

image. Similarly, it should be noted that each pyramid-level nonzero entry. In this case, N-14 (number of channels), and 

index in this processing section refers to the pyramid level Q-2, 

from which it was originally derived, which is twice the It should be noted that after this process, the resulting 

resolution of that associated with that level after filtering/ image is half the resolution of the original. Similarly, it 

downsampling. 55 should be noted that each pyramid-level index in this 

It should also be noted that all images generated by the process refers to the pyramid level from which it was 

above repeated upsampling, filtering, and adding process are originally derived, which is twice the resolution of that 

Q-th-power-JND images. The level -0 image is used in two associated with that level after filtering/downsampling. 

fashions, where it is sent directly to summary processing on Finally, it should be noted that all images generated by the 

path 764, or upsampled and filtered in step 763 to the 60 repeated filtering/downsampling and adding/maxing process 

original image resolution for display purposes. can be added with weights k L and 1-k^ to produce JND 

FIG. 21 illustrates a detailed block diagram of an alternate images. The level-0 image can be processed in two fashions, 

embodiment of the luminance metric generating section 240. where the level-0 image is sent directly to JND summary 

Since the luminance metric generating of FIG. 21 contains processing via path 2161 or upsampled and filtered by stage 

many similar stages to that of the luminance metric gener- 65 2170 to the original image resolution for display purposes, 

ating of FIG. 7, a description is provided below only for the In general, the luminance metric generating section of 

dissimilar stages. FIG. 21 is the preferred embodiment, whereas the luminance 
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metric generating section of FIG. 7 is an alternate embodi- Since the Hq channel is missing, various parameters and 
ment. One reason is that the max-sum method is computa- the "pathway" of the running-maximum and running-sum 
tionally less expensive. Thus, if dynamic range in an integer are modified. For example, the value of N that determines k 
implementation is desired, then the luminance metric gen- is changed to 12 from 14. The same value, k=0.783, is used 
erating section of FIG. 21 is preferred. Otherwise, if a 5 for both full-height and half -height processing and is the 
floating point processor is employed, then the luminance average of the full-height and half-height constants corn- 
metric generating section of FIG. 7 can also be used as well. puted from the equation given above. 
Half-Height Luminance Processing Finally, as in the full-height embodiment, the luminance 

Since storage requirement and computational cycles are map for summary measures must be brought to full image 

important processing issues, the present invention provides 10 resolution before it is displayed. Just prior to display, the 

an alternate embodiment of a perceptual metric generator final JND map is brought to fiill resolution in the horizontal 

that is capable of processing half-height images, e.g., top direction, by upsampling and followed by 1x3 filtering 

and bottom fields of an interlace image. This embodiment (kernel 0.5[1,2,1]) in stage 2310, In the vertical direction, 

reduces the amount of storage space necessary to store line-doubling is performed in stage 2320. 

full-height images and at the same time, reduces the number 15 It should be noted that, since each spatial filter has both 

of computational cycles. horizontal and vertical spatial dependence, there are some 

If the half-height images are to be passed through directly differences in the half-height embodiment as compared to its 

without zero-filling to the true image height, then the above full-height counterpart. However, it has been observed that 

luminance processing section 220 must be modified to the half-height embodiment will only exhibit slight pertur- 

reflect that the inherent vertical resolution is only half the 20 bations in the correlations with subjective ratings. Thus, the 

inherent horizontal resolution. FIG. 22 and FIG. 23 are block non-interlace option can be employed as a viable and 

diagrams of luminance processing section and luminance time-saving alternative to the interlace option, 

metric generating section for processing half -height images. FIG. 8 illustrates a detailed block diagram of the chromi- 

Comparison between these diagrams (FIG. 22 and FIG. nance processing section 230. Again, FIG. 8 can be per- 

23) and the corresponding diagrams for full-height interlace 25 ceived as a flowchart of chrominance processing steps or as 

images (FIG. 20 and FIG. 21) reveal that many stages are a block diagram of the chrominance processing section 

identical. As such, the description below for FIG. 22 and having a plurality of hardware components for performing 

FIG, 23 are limited to the differences between the two such chrominance processing steps, e.g., filters, various 

implementations. circuit components and/or application specific integrated 

First, the highest-resolution horizontal channel, H 0 , is 30 circuits (ASIC). It should be noted that aside from the 

eliminated. Second, the highest resolution image is lowpass- pyramid having levels 0, 1, 2, the chrominance processing 

filtered vertically (i.e., along columns) with a 3x1 "Kell" section 230 computes pyramids with levels 0,1, . . . , 6 for 

filter (a vertical filter) 2210 with weights (Ve, Y*, Vs). This both u* 802 and v* 804. 

filter is an anti-aliasing filter in the vertical dimension for The spatial resolution of the chrominance channels (i.e., 

removing effect due to the fact that the lines are sampled in 35 the resolution of the highest pyramid level) is chosen to be 

half the spatial frequency. Namely, it is a lowpass filter that equal to that of luminance because the resolution is driven 

blurs vertically. The resulting vertically filtered image, Lo, is by the inter-pixel spacing, and not by the inter-receptor 

then horizontally filtered with a 1x3 filter 2220 (kernel spacing. The inter- receptor spacing is 0.007 degrees of 

0.25[1,2,1]). The resulting image, LP 0 , is a horizontally visual angle, and the inter-pixel spacing is 0.03 degrees — 

low-passed version of Lq. 40 derived from a screen with 480 pixels in its height, viewed 

Next, Lq and LP 0 are combined to produce a bandpass at four times its height. On the other hand, Morgan and Aiba 

(LPq-Lq) divided by lowpass (LP 0 ) oriented response analo- (1985) found that red-green vernier acuity is reduced by a 

gous to the (S-C)/(S+C) responses of the other oriented factor of three at isoluminance, a factor that is to be equated 

channels. with three inter-receptor spacings for other kinds of acuity. 

In turn, image LP 0 (a half-height image of 720x240 45 Also, the resolution of the blue-yellow chromatic channel is 

pixels) is horizontally down-sampled in stage 2200 to a full limited by the fact that the visual system is tritanopic (blue 

height half -resolution image (360x240). At this point, the blind) for lights subtending less than about 2' (or 0.033 deg.) 

aspect ratio is such that processing on this image and of visual angle (see Wyszecki and Stiles, 1982, p. 571). The 

throughout the remaining three pyramid levels can now pixel resolution of 0.03 degrees of visual angle is very close 

continue as in the full-height options. 50 to the largest of these values, such that it is appropriate to 

Next, down-sampling and up-sampling between the half- equate the pixel resolutions of luminance and chrominance 

height images from Level 0 and the full height images of channels. 

Level 1 is done with a 1x3 filtering/horizontal down- The chrominance pyramid extends to level 6. This sup- 
sampling by stage 2232 (labeled 1x3 filter & d.s.) and ports evidence that observers notice differences between 
horizontal up-sampling (h.u.s.)/lx3 filtering by stage 2234, 55 large, spatially uniform fields of color. This effect can be 
respectively. Horizontal down -sampling applies decimation addressed by using a spatially extended JND map. Quanti- 
by a factor of two in the horizontal dimension, i.e., throwing tative evidence for contributions to the JND by such low 
out every other column of the image. Horizontal spatial frequencies has been presented by Mullen (1985). 
up-sampling inserts a column of zeros between each two Returning to FIG. 8, similar to luminance processing, 
columns of the existing image. The filter kernel after upsam- 60 spatial decomposition at seven resolution levels is accom- 
pling is defined by 0.5[1,2,1], for the reason noted above. plished through pyramid decomposition, which smears and 
FIG. 23 illustrates a luminance metric generating section downsamples the image by a factor of 2 at each successively 
for processing half-height images. First, the highest- coarser level of resolution. The original, full-resolution 
resolution horizontal channel, Hq, is not present. For the V 0 image is called the zeroth level (level 0) of the pyramid, 
channel, a 1x3 filtering and horizontal down-sampling stage 65 Subsequent levels, at lower resolutions, are obtained by an 
2300 is provided to replace the 3x3 filtering and down- operation called REDUCE. Namely, a three-tap low-pass 
sampling stage as used in other channels. filter 805 with weights (l,2,l)/4 is applied to level 0 sequen- 
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tially in each direction of the image to generate a blurred However, for levels 3, . . . , 6, sequential filtering and 

image. The resulting image is then subsampled by a factor downsampling of D 2 is performed using the same method as 

of 2 (every other pixel is removed) to create the next level, in the luminance processing, but without adding new terms, 

level 1. These D m values are used by in step 840 in the spirit of 

In step (or stage) 810, a four-field average is performed on 5 perturbation theory, in the sense that, since luminance is a 

the u* images for each resolution level, and also on the v* more important determiner of JNDs, the effect of luminance 

images, with tap weights (0.25, 0.25, 0.25, 0.25)), i.e., let: on chrominance are presumed to be more important than the 

effect of chrominance on luminance. Namely, since lumi- 

{ 3 (23!) nance effects are expected to predominate over chrominance 

*~ 4 2 10 e ^ ecls m most cases > me chrominance processing section 

;=0 can be viewed as a first-order perturbation on the luminance 

3 processing section. Therefore, the effects of luminance (the 

V( «- v/, D OT ) are modeled as masking chrominance, but not the 

J** reverse. 

5 The masked chrominance contrast pyramid is generated 

l • • *u C u ■ j m." • ** a 4. by using the luminance-channel denominator pyramid D m 

where j is the field index. This averaging operation reflects * • ir . . , r . % • 

the inherent low-pass temporal filtering of the color and the same functional form that is used for the luminance 

channels, and replaces the "early-late" processing of the transducer to mask the chrominance square-contrast 

temporal luminance channel. pyramids, for all pyramid levels m-0, 1, 2: 

In step (or stage) 820, a non-oriented Laplacian spatial 20 
filter 820 is applied to each of the u* and v* images. The ^_ *-s*nC& (25 > 

filter has the following 3x3 kernel: a c c£s +m c D m +c c ' 

C m { ks m ck M 



1 2 I 

2 -12 2 
1 2 1 



(24) 



It should be noted that the algebraic sign removed in step 
830 is reattached through the factors s um and s^. This 

chosen to have wo total weight and to respond with a operat ion produces masked contrast pyramids for u* and vV 

maximum strength of 1 to any straight edge between two 3Q Calibration has determined the values a c =0.15, c c =0.3, 

uniformareaswithumtv^ and p Furthermore, setting m c to a 

maximum response is attained by a horizontal or vertical . 1 c c . , ' ^ c , _ . t - • if i- 

j \ r™ . Jf, * , . „„„ ■ . nf value of 1 has produced sufficient performance in ah cah- 

edge.) This renders the u* and v* images into maps or . r r 

chrominance difference, evaluated in uniform-color-space brations and predictions. 

(JND) units FIG. 24 illustrates a detailed block diagram of an alternate 

In step (or stage) 830, contrast computation is performed 35 embodiment of the chrominance processing section 230. 

directly on the u* and v* images resulting from step 820 as Since the chrominance processing section of FIG. 24 con- 

the chrominance contrast pyramids, to be interpreted analo- tains many similar stages to that of the chrominance pro- 

gously with the Micbelson contrasts computed in the Lumi- cessing section of FIG. 8, a description is provided below 

nance processing section. In an analogy with luminance only for the dissimilar stages. 

contrasts, chrominance contrasts are computed via intra- 40 The chrominance spatial and temporal filtering are iden- 
image comparisons effected by Laplacian pyramids. Just as tical for both perceptual metric generators of FIG. 8 and 
the Laplacian difference divided by a spatial average repre- FIG. 24. However, chrominance contrast computation for 
sents the Michelson contrast, which via Weber's law the perceptual metric generator of FIG. 24 is achieved 
assumes a constant value at the 1-JND level (detection without the square operation. Namely, the stage 830 is 
threshold), the Laplacian pyramid operating on u* and v* 45 replaced by stage 2400 in FIG. 24. 
has a 1-JND interpretation. Similarly, this interpretation is More specifically, in step (or stage) 830, the contrast 
modified in the course of calibration. The modification pyramid images level-by-level, is divided by seven con- 
reflects the interaction of all parts of the present invention, stants % (i-0, . . . , 6), whose values are determined by 
and the fact that stimuli eliciting the 1-JND response are not calibration to be 384, 60, 24, 6, 4, 3, 3, respectively. It should 
simple in terms of the perceptual metric generator. 50 be noted that these constants are different from those of FIG. 

Furthermore in step (or stage) 830, the contrast pyramid 8. These constants are analogous to the quantities 

images level-by-level, is divided by seven constants w,. (i-0, . . . , 3) in the luminance processing section. 

q £ (i-0, , . . , 6), whose values are determined by calibration Next, the clipped absolute values of all the u, and v* 

to be 1000, 125, 40, 12.5, 10, 10, 10, respectively. These contrasts [where clip(x)-max(0, x-e)] are computed, where 

constants are analogous to the quantities w,. (i-0, 3) in 55 e-0.75. Again the algebraic signs are preserved and 

the luminance processing section. re -attached for later use. This prevents the possibility of 

In step (or stage) 840, the squares of all the u* and v* recording 0 JNDs between two different images because of 

contrasts are determined, but the algebraic signs are again the ambiguity of the sign loss in the absolute-value opera - 

preserved for later use. The sign preservation prevents the tion. The results are two chrominance contrast pyramids C M) 

possibility of recording 0 JNDs between two different 60 C v . 

images because of the ambiguity of the sign loss in the Another significant difference between the perceptual 

squaring operation. The results are two chrominance square- metric generators of FIG. 8 and FIG. 24 is the implemen- 

contrast pyramids C u , C v . tation of the contrast energy masking. Unlike FIG. 8, the 

In step (or stage) 850, contrast energy masking is per- perceptual metric generators of FIG. 24 implements contrast 

formed. First, the denominator pyramid levels D m (m-0, 1, 65 energy masking 2410 in two separate stages: a cross mask- 

2) are adopted directly from the Luminance processing ing stage and a self masking stage for each of the horizontal 

section 220, without further alteration. and vertical channels (See FIG. 24). Self masking reduces 
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sensitivity in the presence of information within a current 
channel, whereas cross masking reduces sensitivity in the 
presence of information in a neighboring channel. In fact, 
the order of these two separate masking stages can be 
inverted. 

Use the luminance-channel denominator pyramid D m and 
the same functional form that is used for the luminance 
transducer to mask the chrominance contrast pyramids, for 
all pyramid levels m=0, . . . , 6: 



mi-urn 



10 



(26a) 



where z m = 



(H-m e D;)' 



and D f is a filtered and downsampled version of D 2 when 
i>2. Similarly, 



(26b) 20 



where Zm — 



(l+mcDi)' 



Note that the algebraic sign removed above has been reat- 
tached through the factors s ttm and s^. This produces 
masked contrast pyramids for u,- and v,-. Calibration deter- 
mines the values a c «l/2, c c =l/2, P c =1.4, and m c «m / «10/ 
1024. In general, the chrominance processing section of 
FIG. 24 is the preferred embodiment, whereas the chromi- 
nance processing section of FIG. 8 is an alternate embodi- 
ment. 

FIG. 9 illustrates a block diagram of the chrominance 
metric generating section 250. Again, FIG. 9 can be per- 
ceived as a flowchart of 1 chrominance metric generating 
steps or as a block diagram of the chrominance metric 
generating section having a plurality of hardware compo- 
nents for performing such luminance metric generating 
steps, e.g., filters, various circuit components and/or appli- 
cation specific integrated circuits (ASIC). The construction 
of the chrominance JND map is analogous with the con- 
struction of the luminance JND map as discussed above with 
regard to FIG. 7. In the chrominance case, the process 
applies to all the masked-contrast chrominance images gen- 
erated by stage 840 above: i.e., images C m0 , C^, . . . , C u6 , 
C^, and the corresponding images derived from the refer- 
ence sequence (denoted with superscript n f in FIG. 8 and 
FIG. 9). 

The first four steps in the following process apply to the 
above images separately. In the following discussion X 
denotes any of these images derived from the test sequence, 
and by X"^ the corresponding image derived from the 
reference sequence. Given this notation, the steps are as 
follows: 

In step (or stage) 910, the image X is separated into two 
half-wave -rectified images, one for positive contrasts 912 
and the other for negative contrasts 914. In the positive- 
contrast image (called XJ, the signs from the X contrast 
(separately stored as discussed above) are used to assign 
zeros to all pixels in X+ that have negative contrasts. The 
opposite operation occurs in the negative-contrast image X_. 

In step (or stage) 920, for each image X+ and X_, a local 
pooling operation is performed by applying a 3x3 filter to 
convolve the image with a filter kernel of 0.5(1,2,1) hori- 
zontally and vertically. 

Furthermore, in step 920, the resulting images are down- 
sampled by a factor of 2 in each direction, to remove 



25 



30 



35 



40 



28 



redundancy resulting from the pooling operation. The same 
processing as applied to X is performed for the correspond- 
ing reference image X^. 

In step (or stage) 930, the absolute -difference images 
IX^-X^ and |X_-X_^ are computed pixel-by-pixel. The 
resulting images are JND maps. 

In step (or stage) 940, a coring operation is performed on 
the JND maps. Namely, all values less than a threshold t c are 
set to zero. In the preferred embodiment, t c defaults to a 
value of 0.5. 

In step (or stage) 950, the Q-tb power of these images is 
determined. In the preferred embodiment, Q is a positive 
integer that defaults to a value of 2. 

After this process has been completed for all pairs X, X"*, 
summary measures are determined by repeatedly 
upsampling, filtering, and adding all the images up to the 
required level. This is accomplished as follows: 

In step (or stage) 960, upsampling and filtering are applied 
to the level-6 images derived from C ue , C u6 re ^ f Cy^^to 
derive a level-5 image. 

In the next step (or stage), upsampling and filtering are 
applied to the sum of the level-5 image from step 960 with 
the level-5 images derived from C, 
process is continued through level 0. 

Similar to the luminance processing, it should be noted 
that before the final processing step 963, the resulting image 
is half the resolution of the original image. Similarly, it 
should be noted that each pyramid-level index in this 
processing section refers to the pyramid level from which it 
was originally derived, which is twice the resolution of that 
associated with that level after filtering/downsampling. 

It should also be noted that all images generated by the 
above repeated upsampling, filtering, and adding process are 
Q-th-power-JND images. The level-0 image is used in two 
fashions, where it is sent directly to summary processing on 
path 964, or upsampled and filtered in step 963 to the 
original image resolution for display purposes. 

As previously discussed, the luminance and chrominance 
JND maps passed to the output summary step are Q-th- 
power-JND images, and are represented at half the resolu- 
tion of the original image. This exploits the redundancy 
inherent in having performed pooling at each masked- 
contrast stage. Each of these half-resolution images can be 
reduced to a single JND performance measure by averaging 
all the pixels through a Minkowski addition: 



c ™f c 



v5 ,C^.TOs 



50 



(27) 
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is the number of pixels in each JND map, JND /um£win ^ 
and ^^c^minance are the summary measures, and Ljtw? 
and C J//D Q are the half -resolution maps from luminance and 
chrominance map construction, respectively. In each case, 
the sum is over all the pixels in the image. As stated 
previously, the value of the Minkowski exponent Q defaults 
to 2. 

From the luminance and chrominance summary 
measures, a single performance measure for a field is 
computed by Minkowski addition, i.e., 



where Q again defaults to 2. 



] 1/0 » 



(29) 



11/14/2003, EAST Version: 1.4.1 



US 6,360,022 Bl 



29 



30 



A single performance measure, JND^^, for N fields of a 
video sequence by adding the JND values for each field, 
again in the sense of Minkowski. Q defaults to 2. 



JND ~ 



(30) 5 



FIG. 25 illustrates a detailed block diagram of an alternate 
embodiment of the chrominance metric generating section 
250. Since the 1 chrominance metric generating of FIG. 25 
contains many similar stages to that of the chrominance 
metric generating of FIG. 9, a description is provided below 
only for the dissimilar stages. 

More specifically, the "coring" stage 940 and "raise to a 
Q 1 * power" stage 950 are replaced by a plurality of max and 
sum stages which maintain a running sum and a running 
maximum of the channel outputs. Since the process illus- 
trated by FIG. 25 is the same as FIG. 9 up to stage 930, the 
process of FIG. 25 is now described starting from the point 
where the absolute-difference images |X + -X/**| and |X_- 
X_ re/ | have been determined. 

Next, after the process has been completed for all pairs of 
X, X nf , a running-sum image is initialized in stage 2540 to 
contain the sum of the level-6 images derived from C u6 , 
Q,^, C^, and C^. Similarly, a running-maximum image 
is initialized in stage 2542 as the point-by-point maximum 
of these same images. 

Next, the running-sum and running-maximum images are 
upsampled and filtered by stages 2540a and 2542a 
respectively, to comprise two level-5 images. The running- 
sum image is then updated by stage 2544 by adding to it the 
level-5 images derived from C u5 , C uS ' ie/ , C v5 , and C v5 ' ie/ . 
Similarly, the running-maximum image is updated by stage 
2546 by comparing it with the level-5 images derived from 
C mS , Cus*, C v5 , and C v5 w/ . This process is repeated down to 
the pyramid-level 0. 

Finally, having performed the above steps, a point-by- 
point linear combination of the running-sum and running- 
max images is performed to produce the chrominance JND 
map: 

JW^ij^k^unnin^MaxtyHtl-kjRuiming^Sun^ij), (30a) 
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where k c -0.836. The value for k c is determined by approxi- 
mating a Minkowski Q-norm. Given a value of Q and a 
number of images N to be brought together, the value 
k c -[N-N 1/G ]/[N-l] ensures that the approximate measure 
matches the Q-norm exactly when all the compared entries 
(at a pixel) are the same, and also when there is only one 
nonzero entry. In this case, N-28 (number of channels), and 
Q-2. 

As in luminance processing, after these operations the 
resulting image is half the resolution of the original. It 
should be noted that each pyramid-level index in this 55 
process refers to the pyramid level from which it was 
originally derived, which is twice the resolution of that 
associated with that level after filtering/downsampling. 

It should also be noted that all images, generated by the 
repeated upsampling/filtering and adding/maxing process 
above can be added with weights k c and l-k c to produce 
JND images. The level-0 image is used in two fashions, 
where it is sent directly to summary processing or 
upsampled to the original image resolution and filtered for 
display purposes. 

In general, the chrominance metric generating section of 
FIG. 25 is the preferred embodiment, whereas the luminance 
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metric generating section of FIG. 9 is an alternate embodi- 
ment. One reason is that the max-sum method is computa- 
tionally less expensive. Thus, if dynamic range in an integer 
implementation is desired, then the chrominance metric 
generating section of FIG. 25 is preferred. Otherwise, if a 
floating point processor is employed, then the luminance 
metric generating section of FIG. 9 can also be used as well. 
Half-Height Chrominance Processing 

If the half-height images are to be passed through directly 
without zero-filling to the true image height, then the above 
chrominance processing section 230 must be modified to 
reflect that the inherent vertical resolution is only half the 
inherent horizontal resolution. FIG. 26 and FIG. 27 are block 
diagrams of chrominance processing section and chromi- 
nance metric generating section for processing half-height 
images. 

Comparison between these diagrams (FIG. 26 and FIG. 
27) and the corresponding diagrams for full-height interlace 
(FIG. 24 and FIG. 25) reveal that many stages are identical. 
As such, the description below for FIG. 26 and FIG. 27 are 
limited to the differences between the two implementations. 

First, the highest-resolution chrominance channels, Ug* 
and v 0 *, are eliminated. Since chrominance sensitivity is 
generally low at high spatial frequencies, the loss of these 
channels is not significant. 

Second, to produce the next-highest resolution chromi- 
nance images u 2 * and v^*, a lowpass "KelT filter 2600 with 
a kernel of weights (Va, 3 /4, Vs) is applied vertically (i.e., along 
columns). This operation corresponds to the joint filtering of 
the assumed de-interlace filter, together with the filtering 
performed by the vertical components of the 3x3 filters of 
the full-height embodiment. The resulting vertically filtered 
images are then horizontally filtered with a 1x3 filter 2610 
with a kernel of weights 0,25 (1, 2, 1). This filtering of u* 
and v* images makes the half -height images isotropic in 
resolution. The resolution is that of full-height pyramid- 
level 1. 

FIG. 27 illustrates a chrominance metric generating sec- 
tion for processing half-height images. First, the 0-level is 
not present. As such, various parameters and the "pathway" 
of the running-maximum and running-sum are modified. For 
example, the value of N that determines k is changed to 24 
from 28. The same value, k=0.836, is used for both full- and 
half-height processing and is the average of full- and half- 
height constants computed from the formula given above. 

Since the maximum and sum streams are fully accumu- 
lated at pyramid level 1 in the chrominance embodiment, the 
chrominance JND map for the summary measures is only 
half the size (both horizontally and vertically) as the fully 
accumulated luminance map. Thus, prior to combining the 
chrominance and luminance maps to produce the total- JND 
map, the chrominance map must first be brought to the same 
resolution as the luminance map. To achieve this goal, an 
upsample followed by 3x3 filter 2705 is performed to 
produce the chrominance JND map for summary measures. 

As in the full-height embodiment, the chrominance map 
for summary measures must be brought to full image 
resolution before it is displayed. For consistency with the 
analogous operation in the luminance map, the chrominance 
map is brought to full resolution in the horizontal direction, 
by upsampling and followed by 1x3 filtering (kernel 0.5[1, 
2,1]) in stage 2710. In the vertical direction, line-doubling is 
performed in stage 2720. 
JND Output Summaries 

As discussed above, the luminance and chrominance JND 
maps passed to the output summary step are JND images, 
and are represented at half the resolution of the original 
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image. This exploits the redundancy inherent in having the image proper, an infinite bezel results in a set of 

performed pooling at each masked -contrast stage. identical, constant values at any given stage. The effect of 

Next, the luminance and chrominance JND maps JND L image operations, e.g., filtering, performed in this constant 

and JND C are combined into a total- field JND map, JND r region can be computed a priori. Thus, a narrow border (6 

The combination is accomplished with an approximate 5 pixels in the current implementation) can provide the proper 

Minkowski addition, in analogy with the combination of transition from the image proper to the infinite bezel, 

channels to produce the maps JND £ and JND C : At the input, the bezel is given the values Y'=90, U-V'-O. 

^^r^mU^M^h^M^^ <™t ValUC H 0f T' =90 f ^<t SP ° a f dS , h t0 half - lhC 

])] (30b) background value of 15% 01 the maximum screen 
u 1 ftCO(: -p. , c , Jt - jl « rt luminance,) However, the bezel is not needed until after 
where k^.586. Jne selection for k r is determined by 10 pressing, since spatial interactions that extend 
approximating a Minkowski Q-norm^In this case, there are b d ^ do ^ QCCUr untU ^ ^ 
two (2) terms in the max/sum and Q-Z ^ luminan * channel> n0 (and hence no ^ 
In turn, each of the half-resolution JND images (three for values) are appended t0 images until after luminance corn- 
each field: luma, chrominance, and total-field ) is reduced to press ion. In the chrominance channel, borders are appended 
a single JND performance measure called a JAM by the 15 a f ter fr ont end processing. 

following histogram process: In tne luminance channel, the first bezel value after luma 

First, a histogram of JND values (with bin-size Ve JND) is compression is 
created, but values less than a threshold level t c «^ are not 

included. All values greater than 100 JND* s are recorded as r / 90 yr rt «, 

lOOJND's. 20 fimJunuLbezel-^^J J 

Second, the JAM is adopted as the W h percentile of the 
JND scores from the above abbreviated histogram. In this 

fashion, three values JAM luma , JAM cJkFBnttIf and JAM foM , are In £e u* and v* channels, the first bezel values are both 0 

computed for the summary measures corresponding respec- u ™*s* values are propagated through subsequent stages of 

tively to JND^, JND 0 and JND T . This is accomplished for 25 me 1D th ^ wa y s: 

each field in a video sequence. x ) Pixel-by-pixel functions operate on old bezel values to 

From N single-field JAM,,,, values in a video sequence, P ro ^ ce n f? be f 1 Y^ ues ' ^example, the bezel value 

a single performance measure JAM„ is computed in one of resultm S from the 14 P° wer functlon K: 

two fashions, depending on the length of the sequence. bezeLouufbezeLin) 1 - 4 .(30d) 

For N>10: 30 

JAM^r equals the 90th percentile of the histogram of 2) 3x3 spatial filters whose rows and columns sum to P, 

JAM^ W values. set the output bezel value to the input bezel times P. 

For N = 10: 3) Contrast function numerators and four- field time filters 

JAM^ is determined by the following process that pro- (which have tap sums of zero), set the output bezel 

vides a degree of continuity as N increases. More 35 value to 0. 

specifically, a histogram of JAM^ W values is initially ere- At the contrast stage and subsequently, the bezel is given 

ated. Second, this histogram is approximated by a "faux the value 0 in luminance and chrominance channels, i.e., the 

histogram" that has the same minimum, maximum, and logical consequence of operating with a zero-sum linear 

mean as the true histogram, but consists of a constant with kernel on a spatially constant array, 

a single-bin peak at either the minimum or maximum value. 40 The present method for generating the virtual bezel is 

Third, the N-field JAM is adopted as the 90* percentile of disclosed in U.S. patent application Ser. No. 08/997,267 

the JAM fieU scores from the above faux histogram. filed on Dec. 23, 1997 and is entitled "Method for Gener- 

It should be noted that subjective rating data are noisy and ating Image Pyramid Borders". This U.S. patent application 

unreliable for short video sequences (e.g., less than X A Ser. No. 08/997,267 is hereby incorporated by reference, 

second, or 15 frames). Thus, JAM estimates may correlate 45 Integrating Image and Bezel 

poorly with subjective ratings for short sequences. Starting with the pyramid stages of the model, borders 

Image Border Processing need to be supplied. The first border operation on an N-by-M 

In the present perceptual metric generator, it has been input image is to pad the image with 6 pixels (on all sides) 

observed that border-reflection at each stage can propagated with the appropriate bezel value (first_luma_bezel for the 

artifacts into the luminance and chrominance JND maps, 50 compressed luma image, and 0 for u* and v* images). The 

thereby necessitating cropping to keep the JND maps from padded image has dimensions (N+12)x(M+12). For the Y h 

being contaminated by these artifacts. To address this pyramid level (where k can range from 0 to 7) the padded 

criticality, a method was developed to replace the screen image has dimensions QW2*]+12)x([M/2*}+12), where 

border by a gray bezel of infinite extent, but operates without "[x]" denotes the greatest integer in x. 

enhancing the real image size by more than six pixels on a 55 Images at all pyramid levels are registered to each other 

side. Use of this "virtual-bezel" eliminates the need to crop at the upper left hand comer of the image proper. Indices of 

the JND map to avoid border artifacts. The infinite gray the image proper run from 0^y ^height, 0^x^ width. The 

bezel models viewing conditions and hence can be consid- upper left hand corner of the image proper always has 

ered non-artif actual. With this interpretation, the entire JND indices (0,0). Indices of bezel pixels take on height and 

map is uncontaminated by artifacts, and can be exhibited by 60 width values less than 0. For example, the upper left hand 

a Picture Quality Analyzer. bezel pixel is (-6,-6). If we look along the x-dimension 

In the following description, an image that has been starting at the left hand edge for an image of width w (image 

padded with 6 pixels on all sides is referred to as a "padded plus bezel width w+12), the bezel pixels are indexed by 

image", and an unpadded image or its locus within a padded x«(-6,-5, . . . , -1) the real image is indexed (0,1, ... , w-1) 

image is referred to as the "image proper". 65 and the right hand bezel indices span (w,w+l, . . . , w+5). 

Since image operations are local, the virtually infinite Given a padded image, there are four things that can 

bezel can be implemented efficiently. Sufficiently far outside happen depending on the subsequent stage of processing. In 
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describing these operations below, we use single image lines 
to summarize spatial processing (with the understanding that 
the analogous events take place in the vertical direction). 

(a) For pixel-by-pixel operations. When the next opera- 
tion is to operate pixel-by-pixel (e.g., with a nonlinearity), 
the padded image is simply passed through the operation, 
and the output-image dimensions are the same as the input- 
image dimensions The same occurs when the operation is 
between corresponding pixels in different fields or different 
color-bands. 

(b) For 3x3 spatial filters. Suppose (in one dimension) the 
unpadded input image has dimension N*. Then the padded 
input image has dimension Njt+12, and the padded output 
image has dimension Njt+12 as well. The output bezel value 
is first computed and written into at least those bezel pixels 
not otherwise filled by the subsequent image operation. 
Then, starting 1 pixel away from the left edge of the padded 
input image, the 3x3 kernel starts operating on the input 
image and over-writing the bezel values of the output image, 
stopping 1 pixel away from the right (or bottom) edge of the 
image (where the original bezel value survives). The pre- 
written bezel value makes it unnecessary for the kernel 
operation ever to go outside the original (padded) image to 
compute these values. 

(c) For filtering and down-sampling in REDUCE. Given 
an input padded image with dimension N*+12, an output 
array is allocated with dimension [N^J+l^. The bezel value 
is written into at least those bezel pixels not otherwise filled 
by the subsequent filter and downsample operation. Then, 
the input image is filtered according to (b) above, but the 
filter is applied at pixels -4, -2, 0, 2, 4, until the input image 
is exhausted, and the output values are written into consecu- 
tive pixels -2, -1, 0, 1, 2, ... , until there is no further place 
for them in the output image. Note that the position of pixel 
0 in the new image is 7 pixels from the left end of the new 
image. The last -pixel application of the filter takes input 
pixel Njt+3 to output pixel [PV2]+2 if N* is odd, and it takes 
input pixel N*+4 to output pixel [N^/23+2 if N*. is even. 
(Here, we refer to the filter's input pixel as the pixel 
corresponding to the center of the 3-pixel kernel.) 
Luminance Calibration and Prediction 

Psychophysical data were used for two purposes; 1) to 
calibrate the luminance processing section (i.e., to determine 
values for certain processing parameters), and 2) to confirm 
the predictive value of the luminance processing section 
once it was calibrated. In all cases, the stimuli were injected 
into the perceptual metric generator as Y-value images 
immediately prior to the luminance processing. 
Calibration 

The luminance processing section 220 can be calibrated 
iteratively, using two sets of data. One data set is used to 
adjust the pre-masking constants (w (J t„ and t,) in steps 640, 
642 and 650 of the luminance processing section. The other 
set of data is used to adjust the masking-stage constants a, 
p, a and c in step 660 of the luminance processing section. 
Since the JND values are always evaluated after step 660, 
the adjustment of the constants in step 660 with the second 
data set necessitated readjustment of the steps 640, 642 and 
650 constants with the first data set. The readjustment of the 
these constants was continued until no further change was 
observed from one iteration to the next. It should be noted 
that, although the above iterative process starts out by 
interpreting a unit value of unmasked contrast (steps 640, 
642 and 650) as one JND of visual output, the process of 
masking perturbs this interpretation. The details of the 
adjustments are described in the subsections below. 
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Adjustment of Contrast -normalization Constants (steps 640, 
642 and 650) 

The perceptual metric generator predictions for spatial 
and temporal contrast sensitivities prior to masking were 

5 matched to contrast-sensitivity data for sine waves presented 
by Koenderink and Van Doom (1979). To generate points on 
the perceptual metric generator-based curve, a low- 
amplitude sine wave was presented as a test image to the 
perceptual metric generator (either in space or in time), and 

10 the contrast threshold for 1 JND output was assessed. In 
each case the reference image implicitly had a uniform field 
with the same average luminance as the test field. 

The fit of spatial contrast sensitivity to data (see FIG. 10 
for final fit) was used to adjust the contrast-pyramid sensi- 

15 tivity parameters w 0 , w 2 , and w 2 in steps 640, 642 and 650 
of the perceptual metric generator. The dashed lines in FIG. 

10 represent the sensitivities of the separate pyramid chan- 
nels that comprise the total sensitivity (solid line). It should 
be noted that the spatial model fit in FIG. 10 was not 

20 extended beyond 15 cycles/deg, consistent with the vie wing- 
distance constraint discussed above: a viewing distance of 
four screen-heights. Similar adjustment of w 0 , w 2 , and w 2 
can be performed to accommodate slightly different viewing 
distances; much greater viewing distances might require 

25 lower-resolution pyramid levels, and these could be easily 
incorporated at low computational expense. 
The fit of temporal contrast-sensitivity to data (see FIG. 

11 for final fits) was used to adjust the temporal filter-tap 
parameters t e and t„ as well as the contrast-pyramid sensi- 

30 tivity parameter w 3 . The method used to fit these parameters 
is analogous to the spatial-contrast calibration. The lowest- 
spatial-frequency data of Van Doom and Koenderink at 
various temporal frequencies were matched against the 
sensitivities computed for spatially uniform temporal sine 

35 waves. In each case, the vision-model field rate sampled the 
temporal sine wave at 50 and 60 Hz, and this gave rise to the 
distinct parameter values noted above. 
Adjustment of Masking Constants (step 660) 
The masking-parameter values a, p, a and c (in step 660 

40 of the perceptual metric generator) were fit by comparing 
predictions for masked contrast discrimination with data 
acquired by Carlson and Cohen (1978). The results of the 
final-fit comparison appear in FIG, 12. From the Carlson- 
Cohen study, a single observer's data was chosen subject to 

45 the criteria of being representative and also of having 
sufficient data points. In this case, the perceptual metric 
generator stimulus consisted of a spatial sine wave of given 
pedestal contrast in both test and reference fields, and 
additionally a contrast increment of the test- field sine wave. 

50 The contrast-increment necessary to achieve 1 JND was 
determined from the perceptual metric generator for each 
contrast-pedestal value, and then plotted in FIG. 12. 
Predictions 

After perceptual metric generator calibration, perceptual 
55 metric generator predictions were compared with detection 
and discrimination data from stimuli that were not sine 
waves. This was done in order to check the transferability of 
the sine-wave results to more general stimuli. It will be seen 
from FIGS. 13, 14, and 15 that the predictions were not 
60 applied to patterns with nominal spatial frequencies above 
10 cycles/deg. Such patterns would have had appreciable 
energies at spatial frequencies above 15 cycles/deg, and 
would have aliased with the pixel sampling rate (30 samples 
per degree — see discussion above). 
65 In the first study (FIG. 13), low-contrast disks in the test 
field were detected against a uniform reference field. The 
experimental data are from Blackwell and Blackwell (1971). 
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In running the perceptual metric generator for this particular ^(^^^{(x^+x^/y^+cosC^f a OKnu-AmXx^-x^/y,)] 

study, it was necessary to replace the spatial Q-norm sum- } Y (0- V o ( 32 ) 

mary measure with a maximum. Otherwise the JND result z(0-(Y ( /2){(z / yy, + z ir /y i >cos(2jtf a i)[(m + AmXz,/y r -z x /y s )]}, 
was sensitive to the size of the background of the disk (i.e., 

to image size). 5 where Am is the threshold incremental discrimination 

In the second study (FIG. 14), the detection of a low- c ° Dtn f> yJ "?^ 03 „ 6 II ^ <* romaticitv of lhe f r ^ d 

i* a t i * j\. . ' .j. . phosphor, (x,,, y^WO.301, 0.589) is the chromaticity of the 

amplitude checkerboard, the data was acquired in an unpub- r I *V 8 + + J* ^ 

v v a * a ♦ o ff 8 reen phosphor, z r =l-x r -y r , z-l-x^-y-, and f a=2 

lished study at Sarnoff. m J / • c * 

3 c/deg*0.03 deg/pixel=0.06. The reference-uiiage sme wave 

The third study (data from Carlson and Cohen, 1980) was 10 jg the same as the test-image sine wave but with Am=0. For 

somewhat different from the first two, A blurred edge given purposes of the perceptual metric generator, it is sufficient to 

by erf(ax) was presented in the reference image, and dis- set Y 0 «=l. 

crimination was attempted against an edge given by erf(a'x) To generate points on the model-based curve, the above 

in the test image. Here, x is retinal distance in visual degrees, stimulus was presented at various values of mask contrast m, 

a-Jif/[ln(2)]°' 5 , a'-Jt(f+Af)/[ln(2)]°- 5 , and f is in cycles/deg. 15 and the contrast threshold Am for 1 JND output was 

Here, Af is the change in f required for one JND. The plot assessed. The fit of modeled chromatic-contrast sensitivity 

in FIG. 15 is Aftf versus f. to data (see FIG. 17 for final fit) was used to adjust the 

It can be seen that the perceptual metric generator pre- Parameters a c , |3 C , a c , c c , and k in the perceptual metric 

dictions are well fitted to the data, for the range of spatial generator. 

c , . . f i j« . 4 jL c 20 Comparisons with Ratine Data 

frequencies characteristic of the display at the four-screen- r- ■ u-.l-j e 

....... r J Four image sequences, each with various degrees of 

height viewing distance. j- . a * .u , 

& & distortion, were used to compare the present perceptual 

Chrominance Calibration metric generator with DSCQS rating data. The results are 

As in luminance-parameter calibration, psychophysical plotted in FIG. 18, and reveal a correlation 0.9474 between 

data were used to calibrate chrominance parameters (i.e., to 25 the perceptual metric generator and the data. For each of the 

adjust their values for best model fits). In all cases, the sequences, the perceptual metric generator processed 30 

stimuli were four equal fields, injected into the perceptual fields (as opposed to the four fields used to test previous 

metric generator as images in CIE X, Y, and Z just prior to releases). 

conversion to CIELUV. Several data points were removed from the plot that were 

Adjustment of Contrast-normalization Constants (step 830) 30 P resent in the previous releases. These points were deleted 

The perceptual metric generator predictions for chromatic rea ^° ns * ....... , , . „ 

. \ • t 1- t i_ j 4 ( 1) Five points we re deleted that corresponded to warm-up 

contrast sensitivities prior to masking were matched to > . r n _ ~_ n enn . . «*' 

... , f . , , r - M /.,Aor\ rr- tests on all the subjects. The Rec 500 suggests that the first 

contrast-sensitivity data presented by Mullen (1985). The c . . . J , , , , , , °f , , 

. . J , r c i c u i_ iL five tests in a sequence should be deleted because they 

test sequences used were four equal fields, each with a „ , . .... CtU u- • a . 

■ . M . „ . i ■ . . . • * j 35 represent a stabilization of the subject s judgment, 

horizontally varying spatial sine-wave grating injected as „ „ .u n u r. 

/v v m i t^ j. jr i^.- r (2) For one of the "G wen sequences, there are small shifts 

(X, Y, Z) values. The data used for calibration were from v / lL , t ... . . c 

„ , ' r, T -, , a- 4 u* u u * * • of the test sequence with respect to the reference sequence 

Mullen s FIG. 6, corresponding to which each test image • u . *u ■ . • *u u i a 

« . . \ & . * * . . occurrmg between the images of the trees in the background, 

was a red-green isoluminous sine -wave. At pixel l, the , . . c 7 • .i i- j i_ * . . 

... . i * i_ even when the foreground is exactly aligned between test 

test-image sme wave had tristimulus values given by , c ™ f; -j • . j j 

to ^ J 40 and reference. The blue-screen video was introduced sepa- 

«,^>»C* a 0 Amwy^^-W) tatC^Sc'sr"' Wi ' h * ^ 

z(0-(Yj2){(zVy. + Vy.>«o»(^»0Am(^- zAs )} JN P Maplnterpretation 

The JND Maps are in a form suitable for subsequent 

Here Am is the threshold incremental discrimination 45 pro^ssing to determine JNDs within any spatial or temporal 

contrast, (x„ y>(0.636. 0.364) is the chromaticity of the red n ° ted fove <he values in the maps are in unite 

interference filter (at 602 am). ( Xjt , yJ-(0.122, 0.823) is the of JN ?. S t0 the ? tb T P° wer ! ral ? er than m sun P le ™P 

chromaticity of the green interflrence filter (at 526 am), To <*tain a single JND value for any spatio-temporal 

z r -l-x r -y„ z g -l-x g -y g , and a-0.03 deg/pixel. The region of me video stream, it ^ only necessary to sum up the 

reference-image is a uniform field represented by Equation 50 vah ^ fram tbe ^ Ma P Wltmn that re » on ' and then take 

(28) but with Am-0. For purposes of the perceptual metric me Q,n to ° t , , , . . 

generator, it is sufficient to set Y„=l. . C0U f le IX °! exam P' es will «Unfy dus processing. To 

^ , , , , retrieve 1 JND value for each pixel (probably the most 

To generate points on the model-based curve the above ^ desired ^ the Qth foot of each ^ m (he 

stimulus was presented at various values of f, and the J5 

contrast threshold iAm for 1 JND output was assessed TTie However, for typical MPEG-2 encoder analysis 

fit of modeled chromatic-contrast sensitivity to data (see HcationS) it may be usefill t0 have a single value for 

FIG. 16 for final fit) was used to adjust the parameters q, each lfixlfi ^ macroblocks rather than for each pixeL To 

(i=0, . . . , 6) m the perceptual metric generator. oMain x per macro51ock ^ ^ sum all the Map 

Adjustment of Masking Constants (step 840) 60 outputs within each macroblock, and then take the Qth root. 

The perceptual metric generator predictions for chromi- The result will be a macroblock-resolution map of JND 

nance masking were matched to data presented by Switkes. values. 

et al. (1988). The test sequences used were four equal fields, Pyramid Construction: Image Size & Border Requirements 

each with a horizontally varying spatial sine- wave grating The current implementation of the pyramid method will 

injected as (X, Y, Z) values. To correspond with FIG. 4 of 65 not encounter image-dimension problems if the greater 

that work (chrominance masking of chrominance), at pixel image dimension N and the lesser image dimension M 

i, the test-image sine wave had tristimulus values given by satisfy the following conditions. 
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1) M must be at least 128 

2) M must be divisible by 2 as many times (P) as it takes 
to retrieve a quotient less than 64 

3) N must also be P times divisible by 2. 

The perceptual metric generator identifies as illegal any 
images that do not satisfy these conditions. As an example 
of bow these rules work, consider image dimensions N»720, 
M«480. Condition (a) is satisfied because M>128. Condition 
(b) is met because M can be divided three times by 2, and 
encounters the less-than-64 criterion at division 3 (hence 
P=3). Finally, condition (c) is satisfied because N can also be 
divided by 2 three times to yield an integer. 
Interlace Considerations 

The purpose of the following discussion is to clarify the 
handling of field interlace (and, specifically, inter-line 
spaces) in the present perceptual metric generator. Inter-line 
spaces are not visible by humans viewing displays, but do 
produce pronounced effects in the perceptual metric genera- 
tor if they are modeled by black values. As a result of 
visibility of the lines by the perceptual metric generator, 
vertical image distortions at any spatial frequencies are 
masked by the high-frequency line structure. Furthermore, 
the visibility of the line structure would be a primary cause 
of JND artifacts when an interlaced sequence is compared to 
a non-interlaced sequence. 

A solution to this criticality is to change the display model 
to incorporate the known averaging in space and time that 
takes place in the display itself. Such averaging renders the 
inter-line spaces less visible. The first step is to define the 
magnitudes of these effects to determine the appropriate 
model. 

Temporal averaging occurs in the display because phos- 
phors have a finite decay time. So there will always be, e.g., 
a decaying remnant of the odd lines from field N-l at the 
time of primary emission from the even lines from field N. 
However, compared to the inter-field interval (16500 
microseconds), the phosphor decay times are typically quite 
short, e.g., 70 microseconds for the blue phosphor, 100 
microseconds for the green phosphor, and 700 microseconds 
for the red phosphor. Hence, temporal averaging in the 
display model does not contribute appreciably to inter-line 
smoothing. 

Spatial averaging occurs in the display because the emis- 
sion from a pixel spreads beyond the nominal pixel bound- 
ary. In interlaced displays, the electron -beam spot structure 
was designed conjointly with the interlace architecture. As a 
result, the pixel spread was engineered to be more pro- 
nounced in the vertical direction, so as to fill in the inter-line 
spaces and hence to make them less visible. The spread is 
particularly pronounced at high beam currents, which cor- 
respond to high luminance values and hence to the most 
noticeable parts of an image. Hence, from a display 
perspective, spatial averaging is a good physical model for 
inter-line smoothing. 

Alternatively, some temporal averaging can be used to 
effect inter-line smoothing. The visual system itself would 
appear to perform enough temporal averaging to render the 
inter-line spaces invisible. However, as will be clear from 
the following discussion, the lack of eye movements in the 
present perceptual metric generator has rendered the per- 
ceptual metric generator to depart from the temporal- 
averaging behavior that should otherwise be present. 

It has been observed that human vision is subserved by 
mechanisms with two distinct classes of spatio-temporal 
responses: "sustained", with high spatial but low temporal 
resolution and "transient", with high temporal but low 
spatial resolution. 
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One implementation of this perceptual metric generator 
uses separable space/time filters to shape the responses of 
the two channels. An immediate consequence of this mod- 
eling choice is a temporal filter on the sustained channels 

5 that is quite lowpass in time compared with the 60-Hz 
temporal sampling rate typical of a display. Even the tran- 
sient response is insensitive to the 60-Hz sampling rate. 
However, one element that does not enter the sustained/ 
transient model is the effect of eye movements, and particu- 

10 larly of the ability of the eye to track moving objects in an 
image. This tracking enhances visual sensitivity to details in 
the attended object, in a way that is not captured by 
perceptual metric generator filters that are faithful to psy- 
chophysical experiments with constrained stimuli, 

15 The effect of motion on distortion measures in an image 
sequence can be considerable. If the eye did not track objects 
moving in an image, the blurring in the image that results 
from the sustained temporal response would be accurately 
reflected in a perceptual metric generator with much tem- 

20 poral averaging in one channel. However, the eye does track 
moving objects, so the image is not motion-blurred. Without 
the ability to track moving objects, a perceptual metric 
generator purporting to quantify temporal visual response 
should display motion blur. However, such blur hampers the 

25 generation of an accurate JND map. 

To resolve this difficulty without a tracking model, a 
compromise was made of representing the spatial channel 
(which acquires the role of the "sustained" channel in being 
sensitive to spatial detail) as operating on the last field, 

30 rather than on some time average of fields. As a result of this 
approach, the spatial channel reveals a well-focused JND 
map, as would be the case for an eye that tracked the motions 
of attended objects in an image sequence. 

In keeping with the spirit of the above compromise, one 

35 could still relax the "specious-present" nature of the spatial 
channel so that it averages over two fields, hence over one 
frame. This measure would decrease the visibility of the 
blank lines in an interlaced field, and is more physically and 
physiologically plausible than the "specious-present" solu- 

40 tion. However, one artifact survives the temporal averaging 
of two fields, and that is the appearance of a "comb" where 
a smooth moving edge should be. 

To understand why the comb appears in a model with 
two-field averaging, it is suffice to visualize an object 

45 moving in the time interval between an even field (call it 
field N) and an odd field (call it field N+l). Assuming the 
object has a vertical edge that moves 5 pixels horizontally 
between fields. Also, suppose the object edge is at pixel n of 
the even lines at field N. Then this edge will show up at pixel 

50 n+5 of the odd lines at field N+l. If there is no "filling in" 
between the raster lines of a particular field, then averaging 
field N and field N+l produces an edge that is no longer 
vertical, but alternates between pixels n and n+5. This is the 
"comb" effect. 

55 To understand why the actual visual system does not see 
this comb effect, imagine that the object is interesting 
enough so the eye tracks it faithfully. That means the object 
is stationary on the retina, because the retina anticipates the 
motion of the object into the next field. If the edge of the 

60 object is at pixel n of the even lines of field N, it will also 
be at pixel □ of the odd lines of field N+l, simply because 
the eye's tracking of the object has been nearly perfect. 

To avoid both the comb and other interlace artifacts, the 
perceptual metric generator may perform a spatial filling-in 

65 between the lines of each field in the display. This vertical 
averaging avoids the comb effect because it provides a 
rendition of the instantaneous spatial edge (which any 
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temporal averaging would not). Also, the vertical averaging 
solves the original problem of the visibility of the interlace 
line structure, in a way that is compatible with the known 
spatial spread of the electron-beam spot structure. 

There has thus been shown and described a novel method 
and apparatus for assessing the visibility of differences 
between two input image sequences for improving image 
fidelity and visual task applications. Many changes, 
modifications, variations and other uses and applications of 
the subject invention will, however, become apparent to 
those skilled in the art after considering this specification 
and the accompanying drawings which disclose the embodi- 
ments thereof. 

What is claimed is: 

1. An apparatus for assessing visibility of differences 
between two input image sequences, said apparatus com- 
prising: 

a luminance processing section; 

a chrominance processing section; 

a perceptual metric generating section, coupled to said 
processing sections, for generating an image metric; 

where said luminance processing section comprises a 

downsampler for downsampling at least one of the two 
input image sequences, an image field processor for 
receiving the output of the downsampler, a plurality of 
image field filters each receiving an output from the 
image field processor, a contrast computer for receiving 
outputs from the plurality of image field filters, and a 
non-linear processor for receiving an output from the 
contrast computer. 

2. The apparatus of claim 1 wherein the image field filters 
are spatial filters. 

3. The apparatus of claim 2 wherein the spatial filters are 
center and surround filters. 

4. The apparatus of claim 2 wherein the spatial filters 
comprise four spatial filters (CH, SH, CV, SV) for filtering 
information in two consecutive image fields that are center 
and surround filters comprising 3x3 matrices under the 
following constraints: 

where 

CH represents a filter kernel for performing center hori- 
zontal filtering, has all zeros in rows 1 and 3, and 
positive numbers in row 2 of a 3x3 matrix; 

SH represents a filter kernel for performing surround 
horizontal filtering, has all zeros in row 2, positive 
numbers in row 1, and row 3 the same as row 1 of a 3x3 
matrix; 

CV represents a filter kernel for performing center vertical 
filtering, is the transpose of CH of a 3x3 matrix; and 

SV represents a filter kernel for performing surround 
vertical filtering, is the transpose of SH of a 3x3 matrix. 

5. The apparatus of claim 4 wherein the contrast computer 
performs the following computations: 

{SH3i - CH3i - SH2, + CH2-,) 
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(SV3; 
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-continued 

-CV3i-SV2i+CV2i) 



wST;{S V3; + CV3; + SV2; + CV2;) 



where 

i is a pyramid level of a downsampled image; 2 and 3 refer 
to the consecutive image fields from which the down- 
sampled images are derived by filtering using kernels 
SH, CH, SV, CV, and wST is a calibration factor. 

6. A method of assessing visibility of differences between 
two input image sequences, said apparatus comprising: 

downsampling an image sequence to produce down- 

sampled images having pyramid levels; 
processing image field information from at least two 

image fields within the downsampled images; 
filtering said image field information using at least two 

image field filters to produce filtered images; 
computing contrast information regarding said filtered 

images; and 

processing said contrast information using a non-linear 
process. 

7. The method of claim 6 wherein the image field filters 
are spatial filters. 

8. The apparatus of claim 6 wherein the image field filters 
are center and surround filters. 

9. The method of claim 7 wherein the spatial filters 
comprise four spatial filters (CH, SH, CV, SV) for filtering 
information in two consecutive image fields that are center 
and surround filters comprising 3x3 matrices under the 
following constraints: 

where 

CH represents a filter kernel for performing center hori- 
zontal filtering, has all zeros in rows 1 and 3, and 
positive numbers in row 2 of a 3x3 matrix; 

SH represents a filter kernel for performing surround 
horizontal filtering, has all zeros in row 2, positive 
numbers in row 1, and row 3 the same as row 1 of a 3x3 
matrix; 

CV represents a filter kernel for performing center vertical 
filtering, is the transpose of CH of a 3x3 matrix; and 

SV represents a filter kernel for performing surround 
vertical filtering, is the transpose of SH of a 3x3 matrix. 

10. The method of claim 8 wherein the computing step 
performs the following computations: 

(SH3i - CH3i - SH2j +■ CH2i) 
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' wSTi(SH3i 4- CH3i + SH2i + CH2 t ) 

{SV3; - CV3i - SV2; 4 CV2i) 
wSTi[SV3; + CV3i + SV2; + CV2;) 



wSTi(SH3i + CH3i + SH2 t + CW2,) 



where 

i is a pyramid level of a downsampled image; 2 and 3 refer 
55 to the consecutive image fields from which the down- 
sampled images are derived by filtering using kernels 
SH, CH, SV, CV, and wST is a calibration factor. 
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